cs.RO / 1 / 2603.11071
TinyNav: End-to-End TinyML for Real-Time Autonomous Navigation on Microcontrollers
TinyNav:用于微控制器的端到端 TinyML 实时自主导航
Abstract
Autonomous navigation typically relies on power-intensive processors, limiting accessibility in low-cost robotics. Although microcontrollers offer a resource-efficient alternative, they impose strict constraints on model complexity. We present TinyNav, an end-to-end TinyML system for real-time autonomous navigation on an ESP32 microcontroller. A custom-trained, quantized 2D convolutional neural network processes a 20-frame sliding window of depth data to predict steering and throttle commands. By avoiding 3D convolutions and recurrent layers, the 23k-parameter model achieves 30 ms inference latency. Correlation analysis and Grad-CAM validation indicate consistent spatial awareness and obstacle avoidance behavior. TinyNav demonstrates that responsive autonomous control can be deployed directly on highly constrained edge devices, reducing reliance on external compute resources.
Chinese Translation
自主导航通常依赖于高功耗处理器,这限制了低成本机器人技术的可及性。尽管微控制器提供了一种资源高效的替代方案,但它们对模型复杂性施加了严格的限制。我们提出了 TinyNav,这是一个在 ESP32 微控制器上实现实时自主导航的端到端 TinyML 系统。一个定制训练的量化二维卷积神经网络处理 20 帧深度数据的滑动窗口,以预测转向和油门指令。通过避免三维卷积和递归层,这个拥有 23,000 个参数的模型实现了 30 毫秒的推理延迟。相关性分析和 Grad-CAM 验证表明其具有一致的空间感知和避障行为。TinyNav 展示了响应式自主控制可以直接部署在高度受限的边缘设备上,从而减少对外部计算资源的依赖。
cs.RO / 2 / 2603.11072
OA-NBV: Occlusion-Aware Next-Best-View Planning for Human-Centered Active Perception on Mobile Robots
OA-NBV:面向人类中心主动感知的遮挡感知下一最佳视角规划在移动机器人上的应用
Abstract
We naturally step sideways or lean to see around the obstacle when our view is blocked, and recover a more informative observation. Enabling robots to make the same kind of viewpoint choice is critical for human-centered operations, including search, triage, and disaster response, where cluttered environments and partial visibility frequently degrade downstream perception. However, many Next-Best-View (NBV) methods primarily optimize generic exploration or long-horizon coverage, and do not explicitly target the immediate goal of obtaining a single usable observation of a partially occluded person under real motion constraints. We present Occlusion-Aware Next-Best-View Planning for Human-Centered Active Perception on Mobile Robots (OA-NBV), an occlusion-aware NBV pipeline that autonomously selects the next traversable viewpoint to obtain a more complete view of an occluded human. OA-NBV integrates perception and motion planning by scoring candidate viewpoints using a target-centric visibility model that accounts for occlusion, target scale, and target completeness, while restricting candidates to feasible robot poses. OA-NBV achieves over 90% success rate in both simulation and real-world trials, while baseline NBV methods degrade sharply under occlusion. Beyond success rate, OA-NBV improves observation quality: compared to the strongest baseline, it increases normalized target area by at least 81% and keypoint visibility by at least 58% across settings, making it a drop-in view-selection module for diverse human-centered downstream tasks.
Chinese Translation
当我们的视线被障碍物阻挡时,我们自然会侧身或倾斜以绕过障碍物,从而获得更有信息量的观察。使机器人能够做出相同类型的视角选择对于以人为中心的操作至关重要,包括搜索、分诊和灾难响应等场景,在这些环境中,杂乱的环境和部分可见性常常会降低后续感知的效果。然而,许多下一最佳视角(Next-Best-View, NBV)方法主要优化通用探索或长时间覆盖,并未明确针对在实际运动约束下获取部分遮挡人类的单一可用观察这一直接目标。我们提出了面向人类中心主动感知的遮挡感知下一最佳视角规划(Occlusion-Aware Next-Best-View Planning for Human-Centered Active Perception on Mobile Robots, OA-NBV),这是一种遮挡感知的NBV管道,能够自主选择下一个可行的视角,以获得被遮挡人类的更完整视图。OA-NBV通过使用考虑遮挡、目标尺度和目标完整性的以目标为中心的可见性模型对候选视角进行评分,同时将候选视角限制在可行的机器人姿态范围内,从而实现感知与运动规划的整合。OA-NBV在模拟和现实世界试验中均实现了超过90%的成功率,而基线NBV方法在遮挡情况下表现急剧下降。除了成功率外,OA-NBV还提高了观察质量:与最强基线相比,它在各种设置中将标准化目标区域增加至少81%,关键点可见性提高至少58%,使其成为多样化人类中心下游任务的即插即用视角选择模块。
cs.RO / 3 / 2603.11074
DRAFTO: Decoupled Reduced-space and Adaptive Feasibility-repair Trajectory Optimization for Robotic Manipulators
DRAFTO:解耦的降维自适应可行性修复轨迹优化用于机器人操控器
Abstract
This paper introduces a new algorithm for trajectory optimization, Decoupled Reduced-space and Adaptive Feasibility-repair Trajectory Optimization (DRAFTO). It first constructs a constrained objective that accounts for smoothness, safety, joint limits, and task requirements. Then, it optimizes the coefficients, which are the coordinates of a set of basis functions for trajectory parameterization. To reduce the number of repeated constrained optimizations while handling joint-limit feasibility, the optimization is decoupled into a reduced-space Gauss-Newton (GN) descent for the main iterations and constrained quadratic programming for initialization and terminal feasibility repair. The two-phase acceptance rule with a non-monotone policy is applied to the GN model, which uses a hinge-squared penalty for inequality constraints, to ensure globalizability. The results of our benchmark tests against optimization-based planners, such as CHOMP, TrajOpt, GPMP2, and FACTO, and sampling-based planners, such as RRT-Connect, RRT*, and PRM, validate the high efficiency and reliability across diverse scenarios and tasks. The experiment involving grabbing an object from a drawer further demonstrates the potential for implementation in complex manipulation tasks. The supplemental video is available at https://youtu.be/XisFI37YyTQ.
Chinese Translation
本文介绍了一种新的轨迹优化算法——解耦的降维自适应可行性修复轨迹优化(DRAFTO)。该算法首先构建了一个考虑平滑性、安全性、关节限制和任务要求的约束目标。然后,它优化系数,即一组基函数的坐标,用于轨迹参数化。为了在处理关节限制可行性时减少重复的约束优化次数,优化过程被解耦为主迭代的降维高斯-牛顿(Gauss-Newton, GN)下降和用于初始化及终端可行性修复的约束二次规划。对GN模型应用了具有非单调策略的两阶段接受规则,该模型使用铰链平方惩罚处理不等式约束,以确保全局可行性。我们与基于优化的规划器(如CHOMP、TrajOpt、GPMP2和FACTO)以及基于采样的规划器(如RRT-Connect、RRT*和PRM)进行的基准测试结果验证了该算法在多种场景和任务中的高效性和可靠性。涉及从抽屉中抓取物体的实验进一步展示了其在复杂操控任务中实施的潜力。补充视频可在https://youtu.be/XisFI37YyTQ观看。
cs.RO / 4 / 2603.11077
TATIC: Task-Aware Temporal Learning for Human Intent Inference from Physical Corrections in Human-Robot Collaboration
TATIC:基于任务感知的时间学习用于从人机协作中的物理修正推断人类意图
Abstract
In human-robot collaboration (HRC), robots must adapt online to dynamic task constraints and evolving human intent. While physical corrections provide a natural, low-latency channel for operators to convey motion-level adjustments, extracting task-level semantic intent from such brief interactions remains challenging. Existing foundation-model-based approaches primarily rely on vision and language inputs and lack mechanisms to interpret physical feedback. Meanwhile, traditional physical human-robot interaction (pHRI) methods leverage physical corrections for trajectory guidance but struggle to infer task-level semantics. To bridge this gap, we propose TATIC, a unified framework that utilizes torque-based contact force estimation and a task-aware Temporal Convolutional Network (TCN) to jointly infer discrete task-level intent and estimate continuous motion-level parameters from brief physical corrections. Task-aligned feature canonicalization ensures robust generalization across diverse layouts, while an intent-driven adaptation scheme translates inferred human intent into robot motion adaptations. Experiments achieve a 0.904 Macro-F1 score in intent recognition and demonstrate successful hardware validation in collaborative disassembly (see experimental video at https://youtu.be/xF8A52qwEc8).
Chinese Translation
在人机协作(HRC)中,机器人必须在线适应动态任务约束和不断变化的人类意图。虽然物理修正为操作员传达运动级别调整提供了一种自然、低延迟的渠道,但从这些简短的交互中提取任务级别的语义意图仍然具有挑战性。现有的基础模型方法主要依赖于视觉和语言输入,缺乏解释物理反馈的机制。同时,传统的物理人机交互(pHRI)方法利用物理修正进行轨迹指导,但在推断任务级别语义方面面临困难。为了解决这一问题,我们提出了TATIC,一个统一框架,利用基于扭矩的接触力估计和任务感知的时间卷积网络(TCN),共同推断离散的任务级别意图并从简短的物理修正中估计连续的运动级别参数。任务对齐特征标准化确保在多样化布局中的稳健泛化,而以意图驱动的适应方案则将推断的人类意图转化为机器人运动的适应。实验中在意图识别中达到了0.904的宏观F1分数,并在协作拆解中成功进行了硬件验证(请参见实验视频:https://youtu.be/xF8A52qwEc8)。
cs.RO / 5 / 2603.11080
SELF-VLA: A Skill Enhanced Agentic Vision-Language-Action Framework for Contact-Rich Disassembly
SELF-VLA:一种技能增强的自主视觉-语言-动作框架用于接触丰富的拆解
Abstract
Disassembly automation has long been pursued to address the growing demand for efficient and proper recovery of valuable components from the end-of-life (EoL) electronic products. Existing approaches have demonstrated promising and regimented performance by decomposing the disassembly process into different subtasks. However, each subtask typically requires extensive data preparation, model training, and system management. Moreover, these approaches are often task- and component-specific, making them poorly suited to handle the variability and uncertainty of EoL products and limiting their generalization capabilities. All these factors restrict the practical deployment of current robotic disassembly systems and leave them highly reliant on human labor. With the recent development of foundation models in robotics, vision-language-action (VLA) models have shown impressive performance on standard robotic manipulation tasks, but their applicability to complex, contact-rich, and long-horizon industrial practices like disassembly, which requires sequential and precise manipulation, remains limited. To address this challenge, we propose SELF-VLA, an agentic VLA framework that integrates explicit disassembly skills. Experimental studies demonstrate that our framework significantly outperforms current state-of-the-art end-to-end VLA models on two contact-rich disassembly tasks. The video illustration can be found via https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/IROS-VLA-Video.mp4.
Chinese Translation
拆解自动化一直以来都是为了满足对高效、适当回收报废电子产品中有价值组件的日益增长的需求。现有方法通过将拆解过程分解为不同的子任务,展示了有希望且有条理的性能。然而,每个子任务通常需要大量的数据准备、模型训练和系统管理。此外,这些方法往往是特定于任务和组件的,使其难以处理报废产品的变异性和不确定性,限制了其推广能力。所有这些因素都限制了当前机器人拆解系统的实际部署,使其高度依赖人力。随着基础模型在机器人领域的最新发展,视觉-语言-动作(VLA)模型在标准机器人操作任务上展现了令人印象深刻的性能,但其在复杂、接触丰富和长时间跨度的工业实践(如拆解)中的适用性仍然有限,因为这些任务需要顺序和精确的操作。为了解决这一挑战,我们提出了SELF-VLA,一个集成了明确拆解技能的自主VLA框架。实验研究表明,我们的框架在两个接触丰富的拆解任务上显著优于当前最先进的端到端VLA模型。视频说明可通过以下链接查看:https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/IROS-VLA-Video.mp4。
cs.RO / 6 / 2603.11085
Edge-Assisted Multi-Robot Visual-Inertial SLAM with Efficient Communication
边缘辅助的多机器人视觉惯性同步定位与地图构建的高效通信
Abstract
The integration of cloud computing and edge computing is an effective way to achieve global consistent and real-time multi-robot Simultaneous Localization and Mapping (SLAM). Cloud computing effectively solves the problem of limited computing, communication and storage capacity of terminal equipment. However, limited bandwidth and extremely long communication links between terminal devices and the cloud result in serious performance degradation of multi-robot SLAM systems. To reduce the computational cost of feature tracking and improve the real-time performance of the robot, a lightweight SLAM method of optical flow tracking based on pyramid IMU prediction is proposed. On this basis, a centralized multi-robot SLAM system based on a robot-edge-cloud layered architecture is proposed to realize real-time collaborative SLAM. It avoids the problems of limited on-board computing resources and low execution efficiency of single robot. In this framework, only the feature points and keyframe descriptors are transmitted and lossless encoding and compression are carried out to realize real-time remote information transmission with limited bandwidth resources. This design reduces the actual bandwidth occupied in the process of data transmission, and does not cause the loss of SLAM accuracy caused by data compression. Through experimental verification on the EuRoC dataset, compared with the current most advanced local feature compression method, our method can achieve lower data volume feature transmission, and compared with the current advanced centralized multi-robot SLAM scheme, it can achieve the same or better positioning accuracy under low computational load.
Chinese Translation
云计算与边缘计算的结合是实现全球一致性和实时多机器人同步定位与地图构建(SLAM)的有效方式。云计算有效解决了终端设备计算、通信和存储能力有限的问题。然而,终端设备与云之间的带宽限制和极长的通信链路导致多机器人SLAM系统的性能严重下降。为了降低特征跟踪的计算成本并提高机器人的实时性能,提出了一种基于金字塔IMU预测的光流跟踪轻量级SLAM方法。在此基础上,提出了一种基于机器人-边缘-云分层架构的集中式多机器人SLAM系统,以实现实时协同SLAM。该系统避免了单个机器人计算资源有限和执行效率低下的问题。在该框架下,仅传输特征点和关键帧描述符,并进行无损编码和压缩,以实现有限带宽资源下的实时远程信息传输。该设计减少了数据传输过程中的实际带宽占用,并且不会因数据压缩而导致SLAM精度的损失。通过在EuRoC数据集上的实验验证,与当前最先进的局部特征压缩方法相比,我们的方法能够实现更低数据量的特征传输;与当前先进的集中式多机器人SLAM方案相比,在低计算负载下能够实现相同或更好的定位精度。
cs.RO / 7 / 2603.11101
Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure
千GPU大规模训练与优化方案用于AI原生云端具身智能基础设施
Zhou, Chen, Sun, Haoran, Yang, Hedan, Long, Jing, Xiong, Junwu, Wang, Luqiao, Luo, Mingxi, Yang, Qiming, Di, Shuai, Wang, Song, Zhao, Tianyun, Xu, Wanting, Huang, Wen, Bai, Xiaodong, Tian, Xiaomeng, Xiang, Xiaolong, Gong, Yicheng, Guo, Yongjian, Guo, Yucheng, Ma, Yunxuan, Wei, Yu, Guan, Zhong, Sun, Zhen
Abstract
Embodied intelligence is a key step towards Artificial General Intelligence (AGI), yet its development faces multiple challenges including data, frameworks, infrastructure, and evaluation systems. To address these issues, we have, for the first time in the industry, launched a cloud-based, thousand-GPU distributed training platform for embodied intelligence, built upon the widely adopted LeRobot framework, and have systematically overcome bottlenecks across the entire pipeline. At the data layer, we have restructured the data pipeline to optimize the flow of embodied training data. In terms of training, for the GR00T-N1.5 model, utilizing thousand-GPU clusters and data at the scale of hundreds of millions, the single-round training time has been reduced from 15 hours to just 22 minutes, achieving a 40-fold speedup. At the model layer, by combining variable-length FlashAttention and Data Packing, we have moved from sample redundancy to sequence integration, resulting in a 188% speed increase; {\pi}-0.5 attention optimization has accelerated training by 165%; and FP8 quantization has delivered a 140% speedup. On the infrastructure side, relying on high-performance storage, a 3.2T RDMA network, and a Ray-driven elastic AI data lake, we have achieved deep synergy among data, storage, communication, and computation. We have also built an end-to-end evaluation system, creating a closed loop from training to simulation to assessment. This framework has already been fully validated on thousand-GPU clusters, laying a crucial technical foundation for the development and application of next-generation autonomous intelligent robots, and is expected to accelerate the arrival of the era of human-machine integration.
Chinese Translation
具身智能是实现人工通用智能(AGI)的关键步骤,但其发展面临数据、框架、基础设施和评估系统等多重挑战。为了解决这些问题,我们首次在行业内推出了基于云的千GPU分布式训练平台,专为具身智能而设计,基于广泛采用的LeRobot框架,系统性地克服了整个流程中的瓶颈。在数据层面,我们重构了数据管道,以优化具身训练数据的流动。在训练方面,对于GR00T-N1.5模型,利用千GPU集群和数亿规模的数据,单轮训练时间从15小时减少到仅22分钟,实现了40倍的加速。在模型层面,通过结合可变长度的FlashAttention和数据打包,我们从样本冗余转向序列整合,速度提升了188%;{ ext{π}}-0.5注意力优化加速了训练165%;FP8量化实现了140%的加速。在基础设施方面,依托高性能存储、3.2T RDMA网络和基于Ray的弹性AI数据湖,我们在数据、存储、通信和计算之间实现了深度协同。我们还建立了一个端到端的评估系统,形成了从训练到仿真再到评估的闭环。该框架已在千GPU集群上得到全面验证,为下一代自主智能机器人的开发和应用奠定了重要的技术基础,并有望加速人机融合时代的到来。
cs.RO / 8 / 2603.11110
ResWM: Residual-Action World Model for Visual RL
ResWM:用于视觉强化学习的残差动作世界模型
Abstract
Learning predictive world models from raw visual observations is a central challenge in reinforcement learning (RL), especially for robotics and continuous control. Conventional model-based RL frameworks directly condition future predictions on absolute actions, which makes optimization unstable: the optimal action distributions are task-dependent, unknown a priori, and often lead to oscillatory or inefficient control. To address this, we introduce the Residual-Action World Model (ResWM), a new framework that reformulates the control variable from absolute actions to residual actions -- incremental adjustments relative to the previous step. This design aligns with the inherent smoothness of real-world control, reduces the effective search space, and stabilizes long-horizon planning. To further strengthen the representation, we propose an Observation Difference Encoder that explicitly models the changes between adjacent frames, yielding compact latent dynamics that are naturally coupled with residual actions. ResWM is integrated into a Dreamer-style latent dynamics model with minimal modifications and no extra hyperparameters. Both imagination rollouts and policy optimization are conducted in the residual-action space, enabling smoother exploration, lower control variance, and more reliable planning. Empirical results on the DeepMind Control Suite demonstrate that ResWM achieves consistent improvements in sample efficiency, asymptotic returns, and control smoothness, significantly surpassing strong baselines such as Dreamer and TD-MPC. Beyond performance, ResWM produces more stable and energy-efficient action trajectories, a property critical for robotic systems deployed in real-world environments. These findings suggest that residual action modeling provides a simple yet powerful principle for bridging algorithmic advances in RL with the practical requirements of robotics.
Chinese Translation
从原始视觉观测中学习预测世界模型是强化学习(RL)中的一个核心挑战,尤其是在机器人技术和连续控制领域。传统的基于模型的强化学习框架直接将未来预测与绝对动作相联系,这使得优化过程不稳定:最优动作分布依赖于任务,事先未知,且往往导致振荡或低效控制。为了解决这一问题,我们提出了残差动作世界模型(ResWM),这是一个新的框架,它将控制变量从绝对动作重新表述为残差动作——相对于上一步的增量调整。这一设计与现实世界控制的内在平滑性相一致,减少了有效搜索空间,并稳定了长时间规划。为了进一步增强表示能力,我们提出了一种观察差异编码器,明确建模相邻帧之间的变化,生成与残差动作自然耦合的紧凑潜在动态。ResWM被集成到一个Dreamer风格的潜在动态模型中,进行了最小修改且没有额外超参数。想象回滚和策略优化均在残差动作空间中进行,从而实现更平滑的探索、更低的控制方差和更可靠的规划。在DeepMind控制套件上的实证结果表明,ResWM在样本效率、渐近回报和控制平滑性方面取得了一致的改善,显著超越了Dreamer和TD-MPC等强基线。除了性能,ResWM还生成了更稳定和节能的动作轨迹,这对于在现实环境中部署的机器人系统至关重要。这些发现表明,残差动作建模为将强化学习中的算法进展与机器人技术的实际需求相结合提供了一个简单而强大的原则。
cs.RO / 9 / 2603.11130
Robust Co-design Optimisation for Agile Fixed-Wing UAVs
面向敏捷固定翼无人机的鲁棒协同设计优化
Abstract
Co-design optimisation of autonomous systems has emerged as a powerful alternative to sequential approaches by jointly optimising physical design and control strategies. However, existing frameworks often neglect the robustness required for autonomous systems navigating unstructured, real-world environments. For agile Unmanned Aerial Vehicles (UAVs) operating at the edge of the flight envelope, this lack of robustness yields designs that are sensitive to perturbations and model mismatch. To address this, we propose a robust co-design framework for agile fixed-wing UAVs that integrates parametric uncertainty and wind disturbances directly into the concurrent optimisation process. Our bi-level approach optimises physical design in a high-level loop while discovering nominal solutions via a constrained trajectory planner and evaluating performance across a stochastic Monte Carlo ensemble using feedback LQR control. Validated across three agile flight missions, our strategy consistently outperforms deterministic baselines. The results demonstrate that our robust co-design strategy inherently tailors aerodynamic features, such as wing placement and aspect ratio, to achieve an optimal trade-off between mission performance and disturbance rejection.
Chinese Translation
自主系统的协同设计优化已成为一种强有力的替代方案,能够通过共同优化物理设计和控制策略来替代顺序方法。然而,现有框架往往忽视了自主系统在非结构化真实环境中导航所需的鲁棒性。对于在飞行包络边缘操作的敏捷无人机(UAV)而言,这种鲁棒性的缺失导致设计对扰动和模型不匹配敏感。为了解决这一问题,我们提出了一种针对敏捷固定翼无人机的鲁棒协同设计框架,该框架将参数不确定性和风扰动直接整合到并行优化过程中。我们的双层方法在高层循环中优化物理设计,同时通过受约束的轨迹规划器发现名义解,并利用反馈LQR控制在随机蒙特卡洛集上评估性能。通过对三次敏捷飞行任务的验证,我们的策略始终优于确定性基线。结果表明,我们的鲁棒协同设计策略本质上调整了气动特性,如机翼位置和展弦比,以实现任务性能与扰动抑制之间的最佳权衡。
cs.RO / 10 / 2603.11290
A Causal Approach to Predicting and Improving Human Perceptions of Social Navigation Robots
一种因果方法用于预测和改善人类对社交导航机器人的感知
Abstract
As mobile robots are increasingly deployed in human environments, enabling them to predict how people perceive them is critical for socially adaptable navigation. Predicting perceptions is challenging for two main reasons: (1) HRI prediction models must learn from limited data, and (2) the obtained models must be interpretable to enable safe and effective interactions. Interpretability is particularly important when a robot is perceived as incompetent (e.g., when the robot suddenly stops or rotates away from the goal), as it allows the robot to explain its reasoning and identify controllable factors to improve performance, requiring causal rather than associative reasoning. To address these challenges, we propose a Causal Bayesian Network designed to predict how people perceive a mobile robot's competence and how they interpret its intent during navigation. Additionally, we introduce a novel method to improve perceived robot competence employing a combinatorial search, guided by the proposed causal model, to identify better navigation behaviors. Our method enhances interpretability and generates counterfactual robot motions while achieving comparable or superior predictive performance to state-of-the-art methods, reaching an F1-score of 0.78 and 0.75 for competence and intention on a binary scale. To further assess our method's ability to improve the perceived robot competence, we conducted an online evaluation in which users rated robot behaviors on a 5-point Likert scale. Our method statistically significantly increased the perceived competence of low-competent robot behavior by 83%.
Chinese Translation
随着移动机器人在人工环境中的应用日益增多,使其能够预测人们对它们的感知对于实现社会适应性导航至关重要。预测感知面临两个主要挑战:(1)人机交互(HRI)预测模型必须从有限的数据中学习,以及(2)所获得的模型必须具有可解释性,以便实现安全和有效的互动。当机器人被视为无能时(例如,当机器人突然停止或朝目标相反的方向旋转时),可解释性尤为重要,因为这使得机器人能够解释其推理过程并识别可控因素以改善性能,这需要因果而非关联的推理。为了解决这些挑战,我们提出了一种因果贝叶斯网络,旨在预测人们如何感知移动机器人的能力以及他们在导航过程中如何解读其意图。此外,我们引入了一种新方法,通过组合搜索,借助所提出的因果模型来识别更好的导航行为,从而提高感知的机器人能力。我们的方法增强了可解释性,并生成了反事实的机器人运动,同时在预测性能上达到了与最先进方法相当或更优的结果,在二元尺度上,能力和意图的F1分数分别达到了0.78和0.75。为了进一步评估我们方法改善感知机器人能力的能力,我们进行了在线评估,用户对机器人行为进行了5点李克特量表评分。我们的方法在统计上显著提高了低能力机器人行为的感知能力,提升幅度达83%。
cs.RO / 11 / 2603.11328
Distributed Kalman--Consensus Filtering with Adaptive Uncertainty Weighting for Multi-Object Tracking in Mobile Robot Networks
用于移动机器人网络中的多目标跟踪的自适应不确定性加权分布式卡尔曼共识滤波
Abstract
This paper presents an implementation and evaluation of a Distributed Kalman--Consensus Filter (DKCF) for Multi-Object Tracking (MOT) in mobile robot networks operating under partial observability and heterogeneous localization uncertainty. A key challenge in such systems is the fusion of information from agents with differing localization quality, where frame misalignment can lead to inconsistent estimates, track duplication, and ghost tracks. To address this issue, we build upon the MOTLEE framework and retain its frame-alignment methodology, which uses consistently tracked dynamic objects as transient landmarks to improve relative pose estimates between robots. On top of this framework, we propose an uncertainty-aware adaptive consensus weighting mechanism that dynamically adjusts the influence of neighbor information based on the covariance of the transmitted estimates, thereby reducing the impact of unreliable data during distributed fusion. Local tracking is performed using a Kalman Filter (KF) with a Constant Velocity Model (CVM) and Global Nearest Neighbor (GNN) data association. simulation results demonstrate that adaptive weighting effectively protects local estimates from inconsistent data, yielding a MOTA improvement of 0.09 for agents suffering from localization drift, although system performance remains constrained by communication latency.
Chinese Translation
本文提出了一种分布式卡尔曼共识滤波器(DKCF)的实现与评估,旨在解决在部分可观测性和异构定位不确定性下的移动机器人网络中的多目标跟踪(MOT)问题。在此类系统中,一个关键挑战是融合来自定位质量不同的代理的信息,其中帧对齐不当可能导致估计不一致、轨迹重复和虚假轨迹。为了解决这一问题,我们基于MOTLEE框架,并保留其帧对齐方法,该方法利用持续跟踪的动态物体作为瞬时地标,以改善机器人之间的相对姿态估计。在此框架之上,我们提出了一种不确定性感知的自适应共识加权机制,该机制根据传输估计的协方差动态调整邻居信息的影响,从而在分布式融合过程中减少不可靠数据的影响。局部跟踪使用具有恒定速度模型(CVM)和全局最近邻(GNN)数据关联的卡尔曼滤波器(KF)进行。仿真结果表明,自适应加权有效地保护了局部估计免受不一致数据的影响,使得在遭受定位漂移的代理中,MOTA提高了0.09,尽管系统性能仍受限于通信延迟。
cs.RO / 12 / 2603.11335
ADMM-based Continuous Trajectory Optimization in Graphs of Convex Sets
基于ADMM的凸集图中的连续轨迹优化
Abstract
This paper presents a numerical solver for computing continuous trajectories in non-convex environments. Our approach relies on a customized implementation of the Alternating Direction Method of Multipliers (ADMM) built upon two key components: First, we parameterize trajectories as polynomials, allowing the primal update to be computed in closed form as a minimum-control-effort problem. Second, we introduce the concept of a spatio-temporal allocation graph based on a mixed-integer formulation and pose the slack update as a shortest-path search. The combination of these ingredients results in a solver with several distinct advantages over the state of the art. By jointly optimizing over both discrete spatial and continuous temporal domains, our method accesses a larger search space than existing decoupled approaches, enabling the discovery of superior trajectories. Additionally, the solver's structural robustness ensures reliable convergence from naive initializations, removing the bottleneck of complex warm starting in non-convex environments.
Chinese Translation
本文提出了一种数值求解器,用于计算非凸环境中的连续轨迹。我们的方法依赖于交替方向乘子法(ADMM)的定制实现,基于两个关键组件:首先,我们将轨迹参数化为多项式,从而使原始更新可以以闭式形式计算,作为一个最小控制努力问题。其次,我们引入了基于混合整数形式的时空分配图的概念,并将松弛更新视为最短路径搜索。这些元素的结合使得我们的求解器在多个方面优于现有的最先进技术。通过在离散空间和连续时间域上联合优化,我们的方法能够访问比现有解耦方法更大的搜索空间,从而发现更优的轨迹。此外,求解器的结构鲁棒性确保了从简单初始化的可靠收敛,消除了在非凸环境中复杂热启动的瓶颈。
cs.RO / 13 / 2603.11351
Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning
通过混合大型语言模型(LLM)-符号规划和LLM引导的强化学习进行新奇适应
Abstract
In dynamic open-world environments, autonomous agents often encounter novelties that hinder their ability to find plans to achieve their goals. Specifically, traditional symbolic planners fail to generate plans when the robot's planning domain lacks the operators that enable it to interact appropriately with novel objects in the environment. We propose a neuro-symbolic architecture that integrates symbolic planning, reinforcement learning, and a large language model (LLM) to learn how to handle novel objects. In particular, we leverage the common sense reasoning capability of the LLM to identify missing operators, generate plans with the symbolic AI planner, and write reward functions to guide the reinforcement learning agent in learning control policies for newly identified operators. Our method outperforms the state-of-the-art methods in operator discovery as well as operator learning in continuous robotic domains.
Chinese Translation
在动态开放世界环境中,自主代理经常会遇到新奇事物,这些事物妨碍了它们找到实现目标的计划。具体而言,传统的符号规划器在机器人规划领域缺乏能够使其与环境中新奇物体适当互动的操作符时,无法生成计划。我们提出了一种神经符号架构,集成了符号规划、强化学习和大型语言模型(LLM),以学习如何处理新奇物体。特别地,我们利用LLM的常识推理能力来识别缺失的操作符,使用符号人工智能规划器生成计划,并编写奖励函数,以指导强化学习代理学习新识别操作符的控制策略。我们的方法在操作符发现和连续机器人领域的操作符学习方面优于现有的最先进方法。
cs.RO / 14 / 2603.11364
MirrorDrift: Actuated Mirror-Based Attacks on LiDAR SLAM
MirrorDrift:基于可动镜面的 LiDAR SLAM 攻击
Abstract
LiDAR SLAM provides high-accuracy localization but is fragile to point-cloud corruption because scan matching assumes geometric consistency. Prior physical attacks on LiDAR SLAM largely rely on LiDAR spoofing via external signal injection, which requires sensor-specific timing knowledge and is increasingly mitigated by modern defense mechanisms such as timing obfuscation and injection rejection. In this work, we show that specular reflection offers an injection-free alternative and demonstrate an attack, MirrorDrift, that uses an actuated planar mirror to cause ghost points in LiDAR scans and systematically bias scan-matching correspondences. MirrorDrift optimizes mirror placement, alignment, and actuation. In simulation, it increases the average pose error (APE) by 6.1x over random placement, degrading three SLAM systems to 2.29-3.31 m mean APE. In real-world experiments on a modern LiDAR with state-of-the-art interference mitigation, it induces localization errors of up to 6.03 m. To the best of our knowledge, this is the first successful SLAM-targeted attack against production-grade secure LiDARs.
Chinese Translation
LiDAR SLAM 提供高精度定位,但由于扫描匹配假设几何一致性,因此对点云损坏非常脆弱。之前对 LiDAR SLAM 的物理攻击主要依赖于通过外部信号注入进行的 LiDAR 欺骗,这需要传感器特定的时序知识,并且越来越受到现代防御机制(如时序模糊和注入拒绝)的缓解。在本研究中,我们展示了镜面反射提供了一种无注入的替代方案,并演示了一种攻击方法 MirrorDrift,该方法使用可动平面镜在 LiDAR 扫描中造成虚假点,并系统性地偏置扫描匹配对应关系。MirrorDrift 优化了镜子的放置、对齐和驱动。在仿真中,它使平均位姿误差(APE)比随机放置增加了 6.1 倍,使三个 SLAM 系统的平均 APE 降低到 2.29-3.31 米。在对现代 LiDAR 进行的真实世界实验中,尽管采用了先进的干扰缓解措施,它仍然导致了高达 6.03 米的定位误差。据我们所知,这是针对生产级安全 LiDAR 的首次成功 SLAM 目标攻击。
cs.RO / 15 / 2603.11365
D-SLAMSpoof: An Environment-Agnostic LiDAR Spoofing Attack using Dynamic Point Cloud Injection
D-SLAMSpoof:一种环境无关的激光雷达欺骗攻击,采用动态点云注入
Abstract
In this work, we introduce Dynamic SLAMSpoof (D-SLAMSpoof), a novel attack that compromises LiDAR SLAM even in feature-rich environments. The attack leverages LiDAR spoofing, which injects spurious measurements into LiDAR scans through external laser interference. By designing both spatial injection shapes and temporally coordinated dynamic injection patterns guided by scan-matching principles, D-SLAMSpoof significantly improves attack success rates in real-world, feature-rich environments such as urban areas and indoor spaces, where conventional LiDAR spoofing methods often fail. Furthermore, we propose a practical defense method, ISD-SLAM, that relies solely on inertial dead reckoning signals commonly available in autonomous systems. We demonstrate that ISD-SLAM accurately detects LiDAR spoofing attacks, including D-SLAMSpoof, and effectively mitigates the resulting position drift. Our findings expose inherent vulnerabilities in LiDAR-based SLAM and introduce the first practical defense against LiDAR-based SLAM spoofing using only standard onboard sensors, providing critical insights for improving the security and reliability of autonomous systems.
Chinese Translation
在本研究中,我们介绍了动态SLAMSpoof(D-SLAMSpoof),这是一种新颖的攻击方法,能够在特征丰富的环境中破坏激光雷达SLAM。该攻击利用激光雷达欺骗,通过外部激光干扰将虚假测量注入激光雷达扫描中。通过设计空间注入形状和基于扫描匹配原理的时间协调动态注入模式,D-SLAMSpoof显著提高了在城市区域和室内空间等现实世界特征丰富环境中的攻击成功率,而传统的激光雷达欺骗方法在这些环境中往往失败。此外,我们提出了一种实用的防御方法ISD-SLAM,该方法仅依赖于自主系统中常见的惯性航迹推算信号。我们证明了ISD-SLAM能够准确检测激光雷达欺骗攻击,包括D-SLAMSpoof,并有效减轻由此产生的位置漂移。我们的研究揭示了激光雷达基础SLAM的固有脆弱性,并首次提出了仅使用标准机载传感器对激光雷达基础SLAM欺骗的实用防御,为提高自主系统的安全性和可靠性提供了重要见解。
cs.RO / 16 / 2603.11383
Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics
基于视觉的手部阴影识别在机器人操作中的逆向运动学应用
Abstract
Teleoperation of low-cost robotic manipulators remains challenging due to the complexity of mapping human hand articulations to robot joint commands. We present an offline hand-shadowing and retargeting pipeline from a single egocentric RGB-D camera mounted on 3D-printed glasses. The pipeline detects 21 hand landmarks per hand using MediaPipe Hands, deprojects them into 3D via depth sensing, transforms them into the robot coordinate frame, and solves a damped-least-squares inverse kinematics problem in PyBullet to produce joint commands for the 6-DOF SO-ARM101 robot. A gripper controller maps thumb-index finger geometry to grasp aperture with a four-level fallback hierarchy. Actions are first previewed in a physics simulation before replay on the physical robot through the LeRobot framework. We evaluate the IK retargeting pipeline on a structured pick-and-place benchmark (5-tile grid, 10 grasps per tile) achieving a 90% success rate, and compare it against four vision-language-action policies (ACT, SmolVLA, pi0.5, GR00T N1.5) trained on leader-follower teleoperation data. We also test the IK pipeline in unstructured real-world environments (grocery store, pharmacy), where hand occlusion by surrounding objects reduces success to 9.3% (N=75), highlighting both the promise and current limitations of marker-free analytical retargeting.
Chinese Translation
由于将人手关节动作映射到机器人关节指令的复杂性,低成本机器人操作的远程控制仍然面临挑战。我们提出了一种离线手部阴影识别和重定向管道,该管道基于安装在3D打印眼镜上的单个自我中心RGB-D相机。该管道使用MediaPipe Hands检测每只手的21个手部关键点,通过深度感知将其反投影到3D空间,转换为机器人坐标系,并在PyBullet中求解阻尼最小二乘逆向运动学问题,以生成6自由度SO-ARM101机器人的关节指令。夹爪控制器将拇指和食指的几何形状映射到抓取开口,并具有四级回退层级。在通过LeRobot框架在物理机器人上重放之前,首先在物理仿真中预览动作。我们在一个结构化的拾取和放置基准测试(5块瓷砖网格,每块瓷砖10个抓取)上评估了逆向运动学重定向管道,成功率达到90%,并将其与在领导-跟随远程操作数据上训练的四种视觉-语言-动作策略(ACT、SmolVLA、pi0.5、GR00T N1.5)进行了比较。我们还在非结构化的现实环境(杂货店、药店)中测试了逆向运动学管道,在这些环境中,周围物体对手部的遮挡使成功率降至9.3%(N=75),突显了无标记分析重定向的潜力和当前局限性。
cs.RO / 17 / 2603.11400
Deployment-Time Reliability of Learned Robot Policies
学习型机器人策略的部署时可靠性
Abstract
Recent advances in learning-based robot manipulation have produced policies with remarkable capabilities. Yet, reliability at deployment remains a fundamental barrier to real-world use, where distribution shift, compounding errors, and complex task dependencies collectively undermine system performance. This dissertation investigates how the reliability of learned robot policies can be improved at deployment time through mechanisms that operate around them. We develop three complementary classes of deployment-time mechanisms. First, we introduce runtime monitoring methods that detect impending failures by identifying inconsistencies in closed-loop policy behavior and deviations in task progress, without requiring failure data or task-specific supervision. Second, we propose a data-centric framework for policy interpretability that traces deployment-time successes and failures to influential training demonstrations using influence functions, enabling principled diagnosis and dataset curation. Third, we address reliable long-horizon task execution by formulating policy coordination as the problem of estimating and maximizing the success probability of behavior sequences, and we extend this formulation to open-ended, language-specified tasks through feasibility-aware task planning. By centering on core challenges of deployment, these contributions advance practical foundations for the reliable, real-world use of learned robot policies. Continued progress on these foundations will be essential for enabling trustworthy and scalable robot autonomy in the future.
Chinese Translation
近年来,基于学习的机器人操作技术取得了显著进展,产生了具有卓越能力的策略。然而,部署时的可靠性仍然是实际应用中的一个基本障碍,分布转移、累积错误和复杂任务依赖性共同削弱了系统性能。本论文探讨了如何通过围绕学习型机器人策略的机制来提高其在部署时的可靠性。我们开发了三类互补的部署时机制。首先,我们引入了运行时监控方法,通过识别闭环策略行为中的不一致性和任务进展中的偏差来检测即将发生的故障,而无需依赖故障数据或任务特定的监督。其次,我们提出了一个以数据为中心的策略可解释性框架,该框架使用影响函数追踪部署时的成功与失败与影响训练示例之间的关系,从而实现原则性的诊断和数据集策划。第三,我们通过将策略协调表述为估计和最大化行为序列成功概率的问题,来解决可靠的长时间任务执行,并通过考虑可行性的任务规划将这一表述扩展到开放式、语言指定的任务。通过聚焦于部署的核心挑战,这些贡献为学习型机器人策略的可靠实际应用奠定了实用基础。继续在这些基础上取得进展对于未来实现可信赖和可扩展的机器人自主性至关重要。
cs.RO / 18 / 2603.11404
Real-time Rendering-based Surgical Instrument Tracking via Evolutionary Optimization
基于实时渲染的手术器械跟踪通过进化优化
Abstract
Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.
Chinese Translation
手术器械的准确高效跟踪是机器人辅助微创手术的基础。尽管基于视觉的机器人姿态估计已实现无标记校准,无需繁琐的物理设置,但由于手术器械的部分可见性和特殊的关节设计,手术机器人仍然面临可靠工具跟踪的挑战。该领域以往的研究通常在视觉质量下降和数据稀缺的情况下容易出现不可靠的特征检测,而基于渲染的方法则常常面临计算成本和次优收敛的问题。在本研究中,我们将 CMA-ES(协方差矩阵适应进化策略)这一进化优化策略融入一个多功能跟踪管道中,该管道共同估计手术器械的姿态和关节配置。通过批量渲染高效地并行评估多个姿态候选,该方法显著减少了推理时间并提高了收敛的鲁棒性。所提出的框架进一步推广到无关节角和双手操作的跟踪设置,使其适用于视觉反馈控制和在线手术视频校准。在合成和真实世界数据集上的大量实验表明,所提方法在准确性和运行时间上显著优于先前的方法。
cs.RO / 19 / 2603.11426
Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs
通过检索增强的视觉语言模型实现机器人训练数据的泛化
Abstract
Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. To work towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy generalization is required. Through controlled experiments, we demonstrate that VLMs are effective at analyzing data for generalization, and that our retrieval step effectively identifies examples needed to make accurate classifications with respect to the training data. Furthermore, we scale RADAR to large-scale datasets, where we observe agreement with human-defined benchmark conditions from prior work. We provide demonstrations at radar-analysis.github.io.
Chinese Translation
近期在机器人操作方面的研究已推动策略在新场景中的泛化。然而,通常很难表征不同评估设置如何实际代表给定策略的训练分布中的泛化。为了更精确地评估机器人领域的泛化,我们提出了RADAR,一个可扩展的框架,用于直接比较测试时评估任务与策略训练数据,以确定所需的策略泛化形式。RADAR由一个两阶段的流程组成:首先,使用通用策略嵌入进行检索,以识别与给定评估任务相关的训练示例。接下来,视觉语言模型(VLMs)分析评估任务与检索到的数据,输出关于它们在多种维度上的比较的可解释分析,以及对所需策略泛化类型的整体分类。通过控制实验,我们证明了VLMs在分析泛化数据方面的有效性,并且我们的检索步骤有效地识别了与训练数据相关的示例,以便做出准确的分类。此外,我们将RADAR扩展到大规模数据集,并观察到与先前研究中人定义的基准条件的一致性。我们在 radar-analysis.github.io 提供了演示。
cs.RO / 20 / 2603.11431
A Generalized Theory of Load Distribution in Redundantly-actuated Robotic Systems
冗余驱动机器人系统中载荷分配的广义理论
Abstract
This paper presents a generalized theory which describes how applied loads are distributed within rigid bodies handled by redundantly-actuated robotic systems composed of multiple independent closed-loop kinematic chains. The theory fully characterizes the feasible set of manipulating wrench distributions for a given resultant wrench applied to the rigid body and has important implications for the force-control of multifingered grippers, legged robots, cooperating robots, and other overconstrained mechanisms. We also derive explicit solutions to the wrench synthesis and wrench analysis problems. These solutions are computationally efficient and scale linearly with the number of applied wrenches, requiring neither numerical methods nor the inversion of large matrices. Finally, we identify significant shortcomings in current state-of-the-art approaches and propose corrections. These are supported by illustrative examples that demonstrate the advantages of the improved methods.
Chinese Translation
本文提出了一种广义理论,描述了在由多个独立闭环运动链组成的冗余驱动机器人系统中,施加载荷如何在刚体内部分配。该理论全面表征了针对施加于刚体的给定合成力矩的可行操作力矩分配集,并对多指抓手、腿式机器人、协作机器人及其他超约束机制的力控制具有重要意义。我们还推导了力矩合成和力矩分析问题的显式解。这些解在计算上高效,并且随着施加力矩数量的增加线性扩展,无需数值方法或大矩阵的求逆。最后,我们识别出现有最先进方法中的显著缺陷,并提出了修正方案。这些修正通过示例得到了支持,展示了改进方法的优势。
cs.RO / 21 / 2603.11447
Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation
通过群体竞争学习增强轻量级视觉语言模型以实现社会合规导航
Abstract
Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.
Chinese Translation
社会机器人导航需要对场景语义和人类社会规范进行复杂的整合。扩大视觉语言模型(VLMs)的规模通常会提高社会合规导航的推理和决策能力。然而,模型规模的增加会带来显著的计算开销,限制了其在实时机器人部署中的适用性。相反,轻量级VLMs能够实现高效推理,但在社会复杂环境中往往表现出较弱的推理和决策性能。实现强大的推理能力与高效性之间的平衡仍然是一个开放的挑战。为了解决这一问题,我们提出了群体竞争学习(Group Competitive Learning, GCL),这是一种旨在增强轻量级VLM能力的策略。我们的策略引入了群体竞争目标(Group Competitive Objective, GCO),以协调全局语义与分布正则化,并结合不对称群体优化(Asymmetric Group Optimization, AGO)来探索模型性能的上限。在社会导航基准上的实证评估表明,GCL显著提升了VLM的性能。具体而言,GCL使Qwen2.5-VL-3B学习模型和Qwen3-VL-4B指导模型分别达到了0.968和0.914的F1分数,分别比传统的监督微调(Supervised Fine-Tuning, SFT)提高了40%和12%。值得注意的是,在传统SFT下,3B模型最初落后于8B模型(F1: 0.692 vs. 0.755)。然而,通过GCL,3B模型超越了8B基线模型(提高28%)。这些结果表明,GCL为在实际部署中实现高准确性与计算效率提供了有效的解决方案。
cs.RO / 22 / 2603.11461
CoViLLM: An Adaptive Human-Robot Collaborative Assembly Framework Using Large Language Models for Manufacturing
CoViLLM:一种基于大语言模型的自适应人机协作装配框架用于制造
Abstract
With increasing demand for mass customization, traditional manufacturing robots that rely on rule-based operations lack the flexibility to accommodate customized or new product variants. Human-Robot Collaboration (HRC) has demonstrated potential to improve system adaptability by leveraging human versatility and decision-making capabilities. However, existing HRC frame- works typically depend on predefined perception-manipulation pipelines, limiting their ability to autonomously generate task plans for new product assembly. In this work, we propose CoViLLM, an adaptive human-robot collaborative assembly frame- work that supports the assembly of customized and previously unseen products. CoViLLM combines depth-camera-based localization for object position estimation, human operator classification for identifying new components, and an Large Language Model (LLM) for assembly task planning based on natural language instructions. The framework is validated on the NIST Assembly Task Board for known, customized, and new product cases. Experimental results show that the proposed framework enables flexible collaborative assembly by extending HRC beyond predefined product and task settings.
Chinese Translation
随着对大规模定制的需求不断增加,依赖于规则操作的传统制造机器人缺乏灵活性,无法适应定制或新产品变体。人机协作(HRC)已显示出通过利用人类的多样性和决策能力来提高系统适应性的潜力。然而,现有的HRC框架通常依赖于预定义的感知-操作管道,限制了它们为新产品装配自主生成任务计划的能力。在本研究中,我们提出了CoViLLM,一种支持定制和以前未见产品装配的自适应人机协作装配框架。CoViLLM结合了基于深度摄像头的定位用于物体位置估计、人类操作员分类用于识别新组件,以及基于自然语言指令的装配任务规划的大语言模型(LLM)。该框架在NIST装配任务板上对已知、定制和新产品案例进行了验证。实验结果表明,所提出的框架通过将HRC扩展到预定义产品和任务设置之外,实现了灵活的协作装配。
cs.RO / 23 / 2603.11470
NFPO: Stabilized Policy Optimization of Normalizing Flow for Robotic Policy Learning
NFPO:用于机器人策略学习的归一化流的稳定策略优化
Abstract
Deep Reinforcement Learning (DRL) has experienced significant advancements in recent years and has been widely used in many fields. In DRL-based robotic policy learning, however, current de facto policy parameterization is still multivariate Gaussian (with diagonal covariance matrix), which lacks the ability to model multi-modal distribution. In this work, we explore the adoption of a modern network architecture, i.e. Normalizing Flow (NF) as the policy parameterization for its ability of multi-modal modeling, closed form of log probability and low computation and memory overhead. However, naively training NF in online Reinforcement Learning (RL) usually leads to training instability. We provide a detailed analysis for this phenomenon and successfully address it via simple but effective technique. With extensive experiments in multiple simulation environments, we show our method, NFPO could obtain robust and strong performance in widely used robotic learning tasks and successfully transfer into real-world robots.
Chinese Translation
深度强化学习(DRL)近年来取得了显著进展,并在许多领域得到了广泛应用。然而,在基于DRL的机器人策略学习中,目前实际使用的策略参数化仍然是多元高斯分布(具有对角协方差矩阵),这缺乏建模多模态分布的能力。在本研究中,我们探索了采用现代网络架构,即归一化流(Normalizing Flow, NF)作为策略参数化的方法,因其具备多模态建模能力、对数概率的封闭形式以及较低的计算和内存开销。然而,在在线强化学习(Reinforcement Learning, RL)中,简单地训练NF通常会导致训练不稳定。我们对此现象进行了详细分析,并通过简单但有效的技术成功解决了这一问题。通过在多个仿真环境中的广泛实验,我们展示了我们的方法NFPO能够在广泛使用的机器人学习任务中获得稳健且强大的性能,并成功转移到现实世界的机器人中。
cs.RO / 24 / 2603.11480
SPARK: Skeleton-Parameter Aligned Retargeting on Humanoid Robots with Kinodynamic Trajectory Optimization
SPARK:基于骨架参数对齐的类人机器人重定向与运动学动态轨迹优化
Abstract
Human motion provides rich priors for training general-purpose humanoid control policies, but raw demonstrations are often incompatible with a robot's kinematics and dynamics, limiting their direct use. We present a two-stage pipeline for generating natural and dynamically feasible motion references from task-space human data. First, we convert human motion into a unified robot description format (URDF)-based skeleton representation and calibrate it to the target humanoid's dimensions. By aligning the underlying skeleton structure rather than heuristically modifying task-space targets, this step significantly reduces inverse kinematics error and tuning effort. Second, we refine the retargeted trajectories through progressive kinodynamic trajectory optimization (TO), solved in three stages: kinematic TO, inverse dynamics, and full kinodynamic TO, each warm-started from the previous solution. The final result yields dynamically consistent state trajectories and joint torque profiles, providing high-quality references for learning-based controllers. Together, skeleton calibration and kinodynamic TO enable the generation of natural, physically consistent motion references across diverse humanoid platforms.
Chinese Translation
人类运动为训练通用类人控制策略提供了丰富的先验知识,但原始示范往往与机器人的运动学和动力学不兼容,限制了其直接使用。我们提出了一种两阶段流程,从任务空间的人类数据中生成自然且动态可行的运动参考。首先,我们将人类运动转换为统一的机器人描述格式(URDF)基础上的骨架表示,并将其校准到目标类人的尺寸。通过对齐底层骨架结构,而不是启发式地修改任务空间目标,这一步显著减少了逆运动学误差和调试工作。其次,我们通过逐步的运动学动态轨迹优化(TO)来优化重定向的轨迹,该过程分为三个阶段进行求解:运动学TO、逆动力学和完整的运动学动态TO,每个阶段都从前一个解决方案进行热启动。最终结果产生了动态一致的状态轨迹和关节扭矩曲线,为基于学习的控制器提供了高质量的参考。骨架校准和运动学动态TO共同使得在多样的类人平台上生成自然且物理一致的运动参考成为可能。
cs.RO / 25 / 2603.11537
MiNI-Q: A Miniature, Wire-Free Quadruped with Unbounded, Independently Actuated Leg Joints
MiNI-Q:一种微型无线四足机器人,具有无限制的独立驱动腿关节
Abstract
Physical joint limits are common in legged robots and can restrict workspace, constrain gait design, and increase the risk of hardware damage. This paper introduces MiNI-Q^2, a miniature, wire-free quadruped robot with independently actuated, mechanically unbounded 2-DOF leg joints. We present the mechanical design, kinematic analysis, and experimental validation of the proposed robot. The leg mechanism enables both oscillatory gaits and rotary locomotion while allowing the robot to fold to a minimum height of 2.5 cm. Experimentally, MiNI-Q achieves speeds up to 0.46 m/s and demonstrates low-clearance crawling, stair climbing, inverted locomotion, jumping, and backflipping. The wire-free architecture extends our previous Q8bot design, improving assembly reliability at miniature scale. All mechanical and electrical design files are released open source to support reproducibility and further research.
Chinese Translation
物理关节限制在四足机器人中很常见,可能限制工作空间、约束步态设计并增加硬件损坏的风险。本文介绍了MiNI-Q^2,一种微型无线四足机器人,具有独立驱动的机械无界2自由度(2-DOF)腿关节。我们展示了该机器人的机械设计、运动学分析和实验验证。该腿机制能够实现振荡步态和旋转运动,同时使机器人折叠至最低高度2.5厘米。在实验中,MiNI-Q的速度可达0.46米/秒,并展示了低间隙爬行、爬楼梯、倒立运动、跳跃和后空翻。无线架构扩展了我们之前的Q8bot设计,提高了微型规模下的组装可靠性。所有机械和电气设计文件均已开源,以支持可重复性和进一步研究。
cs.RO / 26 / 2603.11558
RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
RoboClaw:一种可扩展长时间跨度机器人任务的智能框架
Abstract
Vision-Language-Action (VLA) systems have shown strong potential for language-driven robotic manipulation. However, scaling them to long-horizon tasks remains challenging. Existing pipelines typically separate data collection, policy learning, and deployment, resulting in heavy reliance on manual environment resets and brittle multi-policy execution. We present RoboClaw, an agentic robotics framework that unifies data collection, policy learning, and task execution under a single VLM-driven controller. At the policy level, RoboClaw introduces Entangled Action Pairs (EAP), which couple forward manipulation behaviors with inverse recovery actions to form self-resetting loops for autonomous data collection. This mechanism enables continuous on-policy data acquisition and iterative policy refinement with minimal human intervention. During deployment, the same agent performs high-level reasoning and dynamically orchestrates learned policy primitives to accomplish long-horizon tasks. By maintaining consistent contextual semantics across collection and execution, RoboClaw reduces mismatch between the two phases and improves multi-policy robustness. Experiments in real-world manipulation tasks demonstrate improved stability and scalability compared to conventional open-loop pipelines, while significantly reducing human effort throughout the robot lifecycle, achieving a 25% improvement in success rate over baseline methods on long-horizon tasks and reducing human time investment by 53.7%.
Chinese Translation
视觉-语言-行动(VLA)系统在语言驱动的机器人操作中展现出强大的潜力。然而,将其扩展到长时间跨度的任务仍然具有挑战性。现有的流程通常将数据收集、策略学习和部署分开,导致对手动环境重置的高度依赖以及脆弱的多策略执行。我们提出了RoboClaw,一个智能机器人框架,将数据收集、策略学习和任务执行统一在一个VLM驱动的控制器下。在策略层面,RoboClaw引入了纠缠动作对(EAP),将前向操作行为与逆向恢复动作结合,以形成自重置循环,实现自主数据收集。该机制使得在最小人类干预的情况下,能够持续进行在线策略数据获取和迭代策略优化。在部署过程中,同一代理执行高层次推理,并动态协调学习到的策略原语,以完成长时间跨度的任务。通过在数据收集和执行之间保持一致的上下文语义,RoboClaw减少了两个阶段之间的不匹配,提高了多策略的鲁棒性。在真实世界的操作任务中的实验表明,与传统的开放循环流程相比,RoboClaw在稳定性和可扩展性上有所改善,同时显著减少了机器人生命周期中的人力投入,在长时间跨度任务上,相较于基线方法成功率提高了25%,人类时间投入减少了53.7%。
cs.RO / 27 / 2603.11586
Unsupervised LiDAR-Based Multi-UAV Detection and Tracking Under Extreme Sparsity
极端稀疏条件下基于LiDAR的无监督多无人机检测与跟踪
Abstract
Non-repetitive solid-state LiDAR scanning leads to an extremely sparse measurement regime for detecting airborne UAVs: a small quadrotor at 10-25 m typically produces only 1-2 returns per scan, which is far below the point densities assumed by most existing detection approaches and inadequate for robust multi-target data association. We introduce an unsupervised, LiDAR-only pipeline that addresses both detection and tracking without the need for labeled training data. The detector integrates range-adaptive DBSCAN clustering with a three-stage temporal consistency check and is benchmarked on real-world air-to-air flight data under eight different parameter configurations. The best setup attains 0.891 precision, 0.804 recall, and 0.63 m RMSE, and a systematic minPts sweep verifies that most scans contain at most 1-2 target points, directly quantifying the sparsity regime. For multi-target tracking, we compare deterministic Hungarian assignment with joint probabilistic data association (JPDA), each coupled with Interacting Multiple Model filtering, in four simulated scenarios with increasing levels of ambiguity. JPDA cuts identity switches by 64% with negligible impact on MOTA, demonstrating that probabilistic association is advantageous when UAV trajectories approach one another closely. A two-environment evaluation strategy, combining real-world detection with RTK-GPS ground truth and simulation-based tracking with identity-annotated ground truth, overcomes the limitations of GNSS-only evaluation at inter-UAV distances below 2 m.
Chinese Translation
非重复固态LiDAR扫描导致在检测空中无人机时出现极其稀疏的测量状态:在10-25米的距离内,小型四旋翼通常每次扫描仅产生1-2个返回点,这远低于大多数现有检测方法假设的点密度,且不足以实现稳健的多目标数据关联。我们提出了一种无监督的仅基于LiDAR的流程,解决了检测和跟踪问题,而无需标记的训练数据。该检测器将范围自适应的DBSCAN聚类与三阶段时间一致性检查相结合,并在八种不同参数配置下对真实世界的空对空飞行数据进行了基准测试。最佳配置达到了0.891的精度、0.804的召回率和0.63米的均方根误差(RMSE),系统的minPts扫描验证了大多数扫描最多包含1-2个目标点,直接量化了稀疏状态。对于多目标跟踪,我们比较了确定性匈牙利分配与联合概率数据关联(JPDA),每种方法都与交互多模型滤波相结合,在四种模拟场景中逐步增加模糊性。JPDA将身份切换减少了64%,对多目标跟踪精度(MOTA)的影响微乎其微,表明当无人机轨迹彼此接近时,概率关联具有优势。一种结合真实世界检测与RTK-GPS地面真值以及基于模拟的跟踪与身份标注地面真值的双环境评估策略,克服了在无人机间距离低于2米时仅依赖GNSS评估的局限性。
cs.RO / 28 / 2603.11634
Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets
可实际测量的多样性:一种快速的无模型机器人数据集多样性度量
Abstract
Robotics datasets for imitation learning typically consist of long-horizon trajectories of different lengths over states, actions, and high-dimensional observations (e.g., RGB video), making it non-trivial to quantify diversity in a way that respects the underlying trajectory structure and geometry. We extend Shannon and von Neumann entropy to this setting by defining signature transform-based entropy on the Gram matrix of a signature kernel over demonstrations, yielding entropy and diversity metrics that operate directly on the demonstration dataset. Building on these metrics, we study how dataset diversity affects generalization performance in robot imitation learning and propose a simple, model-free way to curate diverse demonstrations. We introduce FAKTUAL (FAst trajectory Kernel enTropy cUration for imitation Learning), a data curation algorithm that selects a subset of demonstrations maximizing entropy given a subset-size budget. FAKTUAL is fully model-free, requires no access to the imitation policy or rollouts, and adds negligible overhead relative to policy training. We evaluate our approach on image and state-based RoboMimic and MetaWorld benchmarks, as well as four real-world manipulation tasks. Across tasks and architectures, diversity-aware curation with FAKTUAL consistently improves downstream success rates over random selection, while being substantially more computationally efficient compared to recent robot data curation methods. Our results suggest that the entropy of demonstration datasets is a practical tool for understanding and improving dataset diversity in robot imitation learning.
Chinese Translation
模仿学习的机器人数据集通常由不同长度的长时间轨迹组成,这些轨迹涉及状态、动作和高维观测(例如,RGB视频),因此以尊重基础轨迹结构和几何的方式量化多样性并非易事。我们通过在演示的Gram矩阵上定义基于签名变换的熵,将香农熵和冯·诺依曼熵扩展到这一设置,从而产生直接作用于演示数据集的熵和多样性度量。在这些度量的基础上,我们研究了数据集多样性如何影响机器人模仿学习中的泛化性能,并提出了一种简单的无模型方式来策划多样化的演示。我们引入了FAKTUAL(快速轨迹核熵策划用于模仿学习),这是一种数据策划算法,选择一个演示子集以最大化在给定子集大小预算下的熵。FAKTUAL完全无模型,不需要访问模仿策略或回放,并且相对于策略训练增加的开销微乎其微。我们在基于图像和状态的RoboMimic和MetaWorld基准测试以及四个真实世界的操作任务上评估了我们的方法。在各种任务和架构中,使用FAKTUAL进行的多样性感知策划在随机选择的基础上始终提高了下游成功率,同时在计算效率上显著优于最近的机器人数据策划方法。我们的结果表明,演示数据集的熵是理解和改善机器人模仿学习中数据集多样性的实用工具。
cs.RO / 29 / 2603.11638
Learn Structure, Adapt on the Fly: Multi-Scale Residual Learning and Online Adaptation for Aerial Manipulators
学习结构,动态适应:多尺度残差学习与空中操控器的在线适应
Abstract
Autonomous Aerial Manipulators (AAMs) are inherently coupled, nonlinear systems that exhibit nonstationary and multiscale residual dynamics, particularly during manipulator reconfiguration and abrupt payload variations. Conventional analytical dynamic models rely on fixed parametric structures, while static data-driven model assume stationary dynamics and degrade under configuration changes and payload variations. Moreover, existing learning architectures do not explicitly factorize cross-variable coupling and multi-scale temporal effects, conflating instantaneous inertial dynamics with long-horizon regime evolution. We propose a predictive-adaptive framework for real-time residual modeling and compensation in AAMs. The core of this framework is the Factorized Dynamics Transformer (FDT), which treats physical variables as independent tokens. This design enables explicit cross-variable attention while structurally separating short-horizon inertial dependencies from long-horizon aerodynamic effects. To address deployment-time distribution shifts, a Latent Residual Adapter (LRA) performs rapid linear adaptation in the latent space via Recursive Least Squares, preserving the offline nonlinear representation without prohibitive computational overhead. The adapted residual forecast is directly integrated into a residual-compensated adaptive controller. Real-world experiments on an aerial manipulator subjected to unseen payloads demonstrate higher prediction fidelity, accelerated disturbance attenuation, and superior closed-loop tracking precision compared to state-of-the-art learning baselines, all while maintaining strict real-time feasibility.
Chinese Translation
自主空中操控器(AAMs)是固有耦合的非线性系统,表现出非平稳和多尺度的残差动态,尤其是在操控器重新配置和突发载荷变化期间。传统的分析动态模型依赖于固定的参数结构,而静态数据驱动模型假设动态是平稳的,在配置变化和载荷变化下性能下降。此外,现有的学习架构没有明确分解跨变量耦合和多尺度时间效应,将瞬时惯性动态与长时间范围的演变混为一谈。我们提出了一种用于AAMs实时残差建模和补偿的预测适应框架。该框架的核心是因子化动态变换器(Factorized Dynamics Transformer, FDT),它将物理变量视为独立的标记。这种设计使得跨变量注意力的显式化成为可能,同时在结构上将短时间范围的惯性依赖与长时间范围的气动效应分离。为了应对部署时的分布变化,潜在残差适配器(Latent Residual Adapter, LRA)通过递归最小二乘法在潜在空间中执行快速线性适应,保持离线非线性表示而不产生过高的计算开销。经过适应的残差预测被直接整合到残差补偿自适应控制器中。在对未见载荷的空中操控器进行的真实世界实验中,与最先进的学习基线相比,展示了更高的预测精度、更快的干扰衰减和更优的闭环跟踪精度,同时保持严格的实时可行性。
cs.RO / 30 / 2603.11642
Chunk-Boundary Artifact in Action-Chunked Generative Policies: A Noise-Sensitive Failure Mechanism
动作块生成策略中的块边界伪影:一种对噪声敏感的失效机制
Abstract
Action chunking has become a central design choice for generative visuomotor policies, yet the execution discontinuities that arise at chunk boundaries remain poorly understood. In a frozen pretrained action-chunked policy, we identify chunk-boundary artifact as a noise-sensitive failure mechanism. First, artifact is strongly associated with task failure (p < 1e-4, permutation test) and emerges during the rollout rather than only as a post-hoc symptom. Second, under a fixed observation context, changing only latent noise systematically modulates artifact magnitude. Third, by identifying artifact-related directions in noise space and applying trajectory-level steering, we reliably alter artifact magnitude across all evaluated tasks. In hard-task settings with remaining outcome headroom, the success/failure distribution shifts accordingly; on near-ceiling tasks, positive gains are compressed by policy saturation, while the negative causal effect remains visible. Overall, we recast boundary discontinuity from an unavoidable execution nuisance into an analyzable, noise-dominated, and intervenable failure mechanism.
Chinese Translation
动作块化已成为生成视觉运动策略的核心设计选择,但在块边界处出现的执行不连续性仍然不够清楚。在一个冻结的预训练动作块策略中,我们识别出块边界伪影作为一种对噪声敏感的失效机制。首先,伪影与任务失败之间存在强关联(p < 1e-4,置换检验),并且在执行过程中出现,而不仅仅是事后症状。其次,在固定的观察上下文下,仅改变潜在噪声系统性地调节伪影的幅度。第三,通过识别噪声空间中的伪影相关方向并应用轨迹级别的引导,我们可靠地改变了所有评估任务中的伪影幅度。在具有剩余结果余地的困难任务设置中,成功/失败分布相应地发生变化;在接近上限的任务中,正向增益因策略饱和而被压缩,而负向因果效应仍然可见。总体而言,我们将边界不连续性从不可避免的执行烦恼重新定义为一种可分析的、以噪声为主导的、可干预的失效机制。
cs.RO / 31 / 2603.11649
A Hybrid Neural-Assisted Unscented Kalman Filter for Unmanned Ground Vehicle Navigation
一种混合神经辅助无味卡尔曼滤波器用于无人地面车辆导航
Abstract
Modern autonomous navigation for unmanned ground vehicles relies on different estimators to fuse inertial sensors and GNSS measurements. However, the constant noise covariance matrices often struggle to account for dynamic real-world conditions. In this work we propose a hybrid estimation framework that bridges classical state estimation foundations with modern deep learning approaches. Instead of altering the fundamental unscented Kalman filter equations, a dedicated deep neural network is developed to predict the process and measurement noise uncertainty directly from raw inertial and GNSS measurements. We present a sim2real approach, with training performed only on simulative data. In this manner, we offer perfect ground truth data and relieves the burden of extensive data recordings. To evaluate our proposed approach and examine its generalization capabilities, we employed a 160-minutes test set from three datasets each with different types of vehicles (off-road vehicle, passenger car, and mobile robot), inertial sensors, road surface, and environmental conditions. We demonstrate across the three datasets a position improvement of $12.7\%$ compared to the adaptive model-based approach. Thus, offering a scalable and a more robust solution for unmanned ground vehicles navigation tasks.
Chinese Translation
现代无人地面车辆的自主导航依赖于不同的估计器来融合惯性传感器和全球导航卫星系统(GNSS)测量。然而,恒定的噪声协方差矩阵往往难以应对动态的现实世界条件。在本研究中,我们提出了一种混合估计框架,将经典状态估计基础与现代深度学习方法相结合。我们并未改变基本的无味卡尔曼滤波器方程,而是开发了一个专门的深度神经网络,直接从原始的惯性和GNSS测量中预测过程和测量噪声的不确定性。我们展示了一种从模拟到真实的转变方法,训练仅在模拟数据上进行。通过这种方式,我们提供了完美的真实数据,并减轻了大量数据记录的负担。为了评估我们提出的方法并检验其泛化能力,我们使用了来自三个数据集的160分钟测试集,每个数据集包含不同类型的车辆(越野车、乘用车和移动机器人)、惯性传感器、路面和环境条件。我们在三个数据集中展示了与自适应模型基础方法相比,位置改善达$12.7\%$。因此,为无人地面车辆导航任务提供了一种可扩展且更为稳健的解决方案。
cs.RO / 32 / 2603.11655
Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks
并发抓握与非抓握操作:多阶段灵巧任务的实用方法
Abstract
Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.
Chinese Translation
灵巧手能够实现并发的抓握与非抓握操作,例如在与一个物体互动的同时握住另一个物体,这一能力对于日常任务至关重要,但在机器人领域尚未得到充分探索。学习这种长时间跨度、接触丰富的多阶段行为面临挑战,因为收集演示样本的成本高昂,而端到端策略需要大量数据以便在不同物体几何形状和放置方式之间进行泛化。我们提出了DexMulti,这是一种高效的样本利用方法,旨在实现现实世界中的灵巧多任务操作,它将演示分解为具有明确时间边界的以物体为中心的技能。我们的办法不是学习单一的策略,而是根据当前物体的几何形状检索演示的技能,利用一种不确定性感知估计器对其进行对齐,该估计器跟踪质心和偏航角,并通过检索-对齐-执行的范式来执行这些技能。我们在三个需要并发操作的多阶段任务(抓握 + 拉动、抓握 + 打开和抓握 + 抓握)上进行了评估,使用了两只灵巧手(Allegro 和 LEAP),进行了超过1000次的真实世界试验。我们的方法在训练物体上实现了66%的平均成功率,每个物体仅需3-4次演示,超越了扩散策略基线2-3倍,同时所需的演示数量远少于基线。结果表明,该方法对未见物体和空间变化(高达±25厘米)具有强大的泛化能力。
cs.RO / 33 / 2603.11658
Coupling Tensor Trains with Graph of Convex Sets: Effective Compression, Exploration, and Planning in the C-Space
将张量列与凸集图结合:在配置空间中的有效压缩、探索与规划
Abstract
We present TANGO (Tensor ANd Graph Optimization), a novel motion planning framework that integrates tensor-based compression with structured graph optimization to enable efficient and scalable trajectory generation. While optimization-based planners such as the Graph of Convex Sets (GCS) offer powerful tools for generating smooth, optimal trajectories, they typically rely on a predefined convex characterization of the high-dimensional configuration space-a requirement that is often intractable for general robotic tasks. TANGO builds further by using Tensor Train decomposition to approximate the feasible configuration space in a compressed form, enabling rapid discovery and estimation of task-relevant regions. These regions are then embedded into a GCS-like structure, allowing for geometry-aware motion planning that respects both system constraints and environmental complexity. By coupling tensor-based compression with structured graph reasoning, TANGO enables efficient, geometry-aware motion planning and lays the groundwork for more expressive and scalable representations of configuration space in future robotic systems. Rigorous simulation studies on planar and real robots reinforce our claims of effective compression and higher quality trajectories.
Chinese Translation
我们提出了TANGO(张量与图优化),这是一个新颖的运动规划框架,结合了基于张量的压缩与结构化图优化,以实现高效且可扩展的轨迹生成。尽管基于优化的规划器,如凸集图(GCS),为生成平滑、最优的轨迹提供了强大的工具,但它们通常依赖于高维配置空间的预定义凸特征,这一要求对于一般机器人任务往往难以实现。TANGO进一步利用张量列分解以压缩形式近似可行的配置空间,从而快速发现和估计与任务相关的区域。这些区域随后嵌入到类似GCS的结构中,允许进行尊重系统约束和环境复杂性的几何感知运动规划。通过将基于张量的压缩与结构化图推理相结合,TANGO实现了高效的几何感知运动规划,并为未来机器人系统中配置空间的更具表现力和可扩展的表示奠定了基础。在平面和真实机器人上的严格仿真研究进一步验证了我们关于有效压缩和更高质量轨迹的主张。
cs.RO / 34 / 2603.11811
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset
RADAR:通过语义规划和自主因果环境重置的闭环机器人数据生成
Abstract
The acquisition of large-scale physical interaction data, a critical prerequisite for modern robot learning, is severely bottlenecked by the prohibitive cost and scalability limits of human-in-the-loop collection paradigms. To break this barrier, we introduce Robust Autonomous Data Acquisition for Robotics (RADAR), a fully autonomous, closed-loop data generation engine that completely removes human intervention from the collection cycle. RADAR elegantly divides the cognitive load into a four-module pipeline. Anchored by 2-5 3D human demonstrations as geometric priors, a Vision-Language Model first orchestrates scene-relevant task generation via precise semantic object grounding and skill retrieval. Next, a Graph Neural Network policy translates these subtasks into physical actions via in-context imitation learning. Following execution, the VLM performs automated success evaluation using a structured Visual Question Answering pipeline. Finally, to shatter the bottleneck of manual resets, a Finite State Machine orchestrates an autonomous environment reset and asymmetric data routing mechanism. Driven by simultaneous forward-reverse planning with a strict Last-In, First-Out causal sequence, the system seamlessly restores unstructured workspaces and robustly recovers from execution failures. This continuous brain-cerebellum synergy transforms data collection into a self-sustaining process. Extensive evaluations highlight RADAR's exceptional versatility. In simulation, our framework achieves up to 90% success rates on complex, long-horizon tasks, effortlessly solving challenges where traditional baselines plummet to near-zero performance. In real-world deployments, the system reliably executes diverse, contact-rich skills (e.g., deformable object manipulation) via few-shot adaptation without domain-specific fine-tuning, providing a highly scalable paradigm for robotic data acquisition.
Chinese Translation
获取大规模物理交互数据是现代机器人学习的关键前提,但由于人类参与收集模式的高昂成本和可扩展性限制,这一过程受到严重瓶颈。为了解决这一问题,我们提出了鲁棒自主数据采集系统(RADAR),这是一个完全自主的闭环数据生成引擎,完全消除了收集周期中的人类干预。RADAR优雅地将认知负载分为四个模块管道。以2-5个3D人类演示作为几何先验,视觉-语言模型(Vision-Language Model, VLM)首先通过精确的语义对象定位和技能检索来协调与场景相关的任务生成。接下来,图神经网络(Graph Neural Network)策略通过上下文模仿学习将这些子任务转化为物理动作。执行后,VLM使用结构化的视觉问答管道进行自动成功评估。最后,为了打破手动重置的瓶颈,有限状态机(Finite State Machine)协调自主环境重置和非对称数据路由机制。在严格的后进先出因果序列驱动下,系统通过同时的前向-反向规划无缝恢复非结构化工作空间,并稳健地从执行失败中恢复。这种持续的脑-小脑协同将数据收集转变为自我维持的过程。广泛的评估显示RADAR的卓越多样性。在仿真中,我们的框架在复杂的长时间任务上实现了高达90%的成功率,轻松解决了传统基准几乎降至零性能的挑战。在实际部署中,该系统通过少量样本适应可靠地执行多样的、接触丰富的技能(例如,可变形物体操作),无需领域特定的微调,为机器人数据采集提供了高度可扩展的范式。
cs.RO / 35 / 2603.11963
Energy Prediction on Sloping Ground for Quadruped Robots
四足机器人在倾斜地面上的能量预测
Abstract
Energy management is a fundamental challenge for legged robots in outdoor environments. Endurance directly constrains mission success, while efficient resource use reduces ecological impact. This paper investigates how terrain slope and heading orientation influence the energetic cost of quadruped locomotion. We introduce a simple energy model that relies solely on standard onboard sensors, avoids specialized instrumentation, and remains applicable in previously unexplored environments. The model is identified from field runs on a commercial quadruped and expressed as a compact function of slope angle and heading. Field validation on natural terrain shows near-linear trends of force-equivalent cost with slope angle, consistently higher lateral costs, and additive behavior across trajectory segments, supporting path-level energy prediction for planning-oriented evaluation.
Chinese Translation
能量管理是腿式机器人在户外环境中的一项基本挑战。耐力直接限制了任务的成功,而高效的资源利用则减少了生态影响。本文研究了地形坡度和航向方向如何影响四足运动的能量成本。我们提出了一种简单的能量模型,该模型仅依赖于标准的机载传感器,避免了专门的仪器,并且适用于以前未探索的环境。该模型通过对一款商业四足机器人的实地运行进行识别,并以坡度角和航向的紧凑函数形式表达。对自然地形的实地验证显示,力等效成本与坡度角之间呈近线性趋势,侧向成本始终较高,并且在轨迹段之间表现出加法行为,从而支持了面向规划评估的路径级能量预测。
cs.RO / 36 / 2603.11980
Learning Visuomotor Policy for Multi-Robot Laser Tag Game
多机器人激光标记游戏的视觉运动策略学习
Abstract
In this paper, we study multi robot laser tag, a simplified yet practical shooting-game-style task. Classic modular approaches on these tasks face challenges such as limited observability and reliance on depth mapping and inter robot communication. To overcome these issues, we present an end-to-end visuomotor policy that maps images directly to robot actions. We train a high performing teacher policy with multi agent reinforcement learning and distill its knowledge into a vision-based student policy. Technical designs, including a permutation-invariant feature extractor and depth heatmap input, improve performance over standard architectures. Our policy outperforms classic methods by 16.7% in hitting accuracy and 6% in collision avoidance, and is successfully deployed on real robots. Code will be released publicly.
Chinese Translation
在本文中,我们研究了多机器人激光标记,这是一种简化但实用的射击游戏任务。经典的模块化方法在这些任务中面临诸如可观测性有限以及对深度映射和机器人间通信的依赖等挑战。为了解决这些问题,我们提出了一种端到端的视觉运动策略,该策略将图像直接映射到机器人动作。我们通过多智能体强化学习训练了一个高性能的教师策略,并将其知识提炼为基于视觉的学生策略。技术设计方面,包括一个置换不变的特征提取器和深度热图输入,提升了性能,超越了标准架构。我们的策略在命中准确率上比经典方法提高了16.7%,在碰撞避免上提高了6%,并成功部署在真实机器人上。代码将公开发布。
cs.RO / 37 / 2603.12020
Sim-to-reality adaptation for Deep Reinforcement Learning applied to an underwater docking application
应用于水下对接的深度强化学习的仿真到现实适应
Abstract
Deep Reinforcement Learning (DRL) offers a robust alternative to traditional control methods for autonomous underwater docking, particularly in adapting to unpredictable environmental conditions. However, bridging the "sim-to-real" gap and managing high training latencies remain significant bottlenecks for practical deployment. This paper presents a systematic approach for autonomous docking using the Girona Autonomous Underwater Vehicle (AUV) by leveraging a high-fidelity digital twin environment. We adapted the Stonefish simulator into a multiprocessing RL framework to significantly accelerate the learning process while incorporating realistic AUV dynamics, collision models, and sensor noise. Using the Proximal Policy Optimization (PPO) algorithm, we developed a 6-DoF control policy trained in a headless environment with randomized starting positions to ensure generalized performance. Our reward structure accounts for distance, orientation, action smoothness, and adaptive collision penalties to facilitate soft docking. Experimental results demonstrate that the agent achieved a success rate of over 90% in simulation. Furthermore, successful validation in a physical test tank confirmed the efficacy of the sim-to-reality adaptation, with the DRL controller exhibiting emergent behaviors such as pitch-based braking and yaw oscillations to assist in mechanical alignment.
Chinese Translation
深度强化学习(DRL)为自主水下对接提供了一种强有力的替代传统控制方法的方案,特别是在适应不可预测的环境条件方面。然而,弥合“仿真到现实”的差距以及管理高训练延迟仍然是实际部署中的重大瓶颈。本文提出了一种系统的方法,通过利用高保真数字双胞胎环境,使用赫罗纳自主水下航行器(AUV)进行自主对接。我们将石鱼(Stonefish)模拟器适配为多进程强化学习框架,以显著加速学习过程,同时融入现实的AUV动态、碰撞模型和传感器噪声。使用近端策略优化(Proximal Policy Optimization, PPO)算法,我们开发了一种在无头环境中训练的6自由度控制策略,随机起始位置以确保泛化性能。我们的奖励结构考虑了距离、方向、动作平滑性和自适应碰撞惩罚,以促进软对接。实验结果表明,代理在仿真中达到了超过90%的成功率。此外,在物理测试水槽中的成功验证确认了仿真到现实适应的有效性,DRL控制器表现出如基于俯仰的制动和偏航振荡等涌现行为,以帮助机械对齐。
cs.RO / 38 / 2603.12059
Flight through Narrow Gaps with Morphing-Wing Drones
通过变形翼无人机穿越狭窄间隙
Abstract
The size of a narrow gap traversable by a fixed-wing drone is limited by its wingspan. Inspired by birds, here, we enable the traversal of a gap of sub-wingspan width and height using a morphing-wing drone capable of temporarily sweeping in its wings mid-flight. This maneuver poses control challenges due to sudden lift loss during gap-passage at low flight speeds and the need for precisely timed wing-sweep actuation ahead of the gap. To address these challenges, we first develop an aerodynamic model for general wing-sweep morphing drone flight including low flight speeds and post-stall angles of attack. We integrate longitudinal drone dynamics into an optimal reference trajectory generation and Nonlinear Model Predictive Control framework with runtime adaptive costs and constraints. Validated on a 130 g wing-sweep-morphing drone, our method achieves an average altitude error of 5 cm during narrow-gap passage at forward speeds between 5 and 7 m/s, whilst enforcing fully swept wings near the gap across variable threshold distances. Trajectory analysis shows that the drone can compensate for lift loss during gap-passage by accelerating and pitching upwards ahead of the gap to an extent that differs between reference trajectory optimization objectives. We show that our strategy also allows for accurate gap passage on hardware whilst maintaining a constant forward flight speed reference and near-constant altitude.
Chinese Translation
固定翼无人机可穿越的狭窄间隙的大小受到其翼展的限制。受到鸟类的启发,我们使得一种能够在飞行中暂时收缩机翼的变形翼无人机能够穿越宽度和高度均小于翼展的间隙。这一机动在低飞行速度下穿越间隙时会面临控制挑战,因为在穿越过程中会突然失去升力,并且需要在接近间隙之前精确时机地进行机翼收缩。为了解决这些挑战,我们首先开发了一种适用于一般机翼收缩变形无人机飞行的气动模型,包括低飞行速度和失速攻角。我们将纵向无人机动力学整合到一个具有运行时自适应成本和约束的最优参考轨迹生成和非线性模型预测控制框架中。在一架130克的机翼收缩变形无人机上进行验证,我们的方法在5到7米/秒的前进速度下,穿越狭窄间隙时实现了平均高度误差为5厘米,同时在可变阈值距离下保持机翼完全收缩。轨迹分析表明,无人机可以通过在接近间隙之前加速并向上俯仰来补偿穿越过程中的升力损失,而这一程度在不同的参考轨迹优化目标之间有所不同。我们展示了我们的策略还允许在硬件上实现准确的间隙穿越,同时保持恒定的前进飞行速度参考和近乎恒定的高度。
cs.RO / 39 / 2603.12075
Decentralized Cooperative Localization for Multi-Robot Systems with Asynchronous Sensor Fusion
基于异步传感器融合的多机器人系统去中心化协作定位
Abstract
Decentralized cooperative localization (DCL) is a promising approach for nonholonomic mobile robots operating in GPS-denied environments with limited communication infrastructure. This paper presents a DCL framework in which each robot performs localization locally using an Extended Kalman Filter, while sharing measurement information during update stages only when communication links are available and companion robots are successfully detected by LiDAR. The framework preserves cross-correlation consistency among robot state estimates while handling asynchronous sensor data with heterogeneous sampling rates and accommodating accelerations during dynamic maneuvers. Unlike methods that require pre-aligned coordinate systems, the proposed approach allows robots to initialize with arbitrary reference-frame orientations and achieves automatic alignment through transformation matrices in both the prediction and update stages. To improve robustness in feature-sparse environments, we introduce a dual-landmark evaluation framework that exploits both static environmental features and mobile robots as dynamic landmarks. The proposed framework enables reliable detection and feature extraction during sharp turns, while prediction accuracy is improved through information sharing from mutual observations. Experimental results in both Gazebo simulation and real-world basement environments show that DCL outperforms centralized cooperative localization (CCL), achieving a 34% reduction in RMSE, while the dual-landmark variant yields an improvement of 56%. These results demonstrate the applicability of DCL to challenging domains such as enclosed spaces, underwater environments, and feature-sparse terrains where conventional localization methods are ineffective.
Chinese Translation
去中心化协作定位(DCL)是一种有前景的方法,适用于在GPS信号缺失且通信基础设施有限的环境中操作的非完整移动机器人。本文提出了一种DCL框架,其中每个机器人使用扩展卡尔曼滤波器进行本地定位,并在更新阶段仅在通信链路可用且通过激光雷达成功检测到伴随机器人时共享测量信息。该框架在处理异步传感器数据和异构采样率时,保持机器人状态估计之间的交叉相关一致性,并在动态机动过程中适应加速度。与需要预对齐坐标系的方法不同,所提出的方法允许机器人以任意参考框架方向初始化,并通过预测和更新阶段的变换矩阵实现自动对齐。为了提高在特征稀疏环境中的鲁棒性,我们引入了一种双地标评估框架,利用静态环境特征和移动机器人作为动态地标。所提出的框架能够在急转弯时可靠地进行检测和特征提取,同时通过相互观察的信息共享提高预测精度。在Gazebo仿真和真实地下环境中的实验结果表明,DCL优于集中式协作定位(CCL),RMSE减少了34%,而双地标变体的改进幅度达到56%。这些结果证明了DCL在封闭空间、水下环境和特征稀疏地形等传统定位方法无效的挑战性领域中的适用性。
cs.RO / 40 / 2603.12099
Towards Dynamic Model Identification and Gravity Compensation for the dVRK-Si Patient Side Manipulator
面向 dVRK-Si 患者侧操控器的动态模型识别与重力补偿
Abstract
The da Vinci Research Kit (dVRK) is widely used for research in robot-assisted surgery, but most modeling and control methods target the first-generation dVRK Classic. The recently introduced dVRK-Si, built from da Vinci Si hardware, features a redesigned Patient Side Manipulator (PSM) with substantially larger gravity loading, which can degrade control if unmodeled. This paper presents the first complete kinematic and dynamic modeling framework for the dVRK-Si PSM. We derive a modified DH kinematic model that captures the closed-chain parallelogram mechanism, formulate dynamics via the Euler-Lagrange method, and express inverse dynamics in a linear-in-parameters regressor form. Dynamic parameters are identified from data collected on a periodic excitation trajectory optimized for numerical conditioning and estimated by convex optimization with physical feasibility constraints. Using the identified model, we implement real-time gravity compensation and computed-torque feedforward in the dVRK control stack. Experiments on a physical dVRK-Si show that the gravity compensation reduces steady-state joint errors by 68-84% and decreases end-effector tip drift during static holds from 4.2 mm to 0.7 mm. Computed-torque feedforward further improves transient and position tracking accuracy. For sinusoidal trajectory tracking, computed-torque feedforward reduces position errors by 35% versus gravity-only feedforward and by 40% versus PID-only. The proposed pipeline supports reliable control, high-fidelity simulation, and learning-based automation on the dVRK-Si.
Chinese Translation
达芬奇研究工具包(dVRK)广泛应用于机器人辅助外科手术的研究,但大多数建模和控制方法针对的是第一代 dVRK Classic。最近推出的 dVRK-Si 基于达芬奇 Si 硬件,具有重新设计的患者侧操控器(PSM),其重力负载显著增大,若未建模可能会降低控制效果。本文提出了 dVRK-Si PSM 的第一个完整的运动学和动力学建模框架。我们推导出了一种修改的 DH 运动学模型,以捕捉闭链平行四边形机制,通过欧拉-拉格朗日方法构建动力学,并以线性参数回归形式表达逆动力学。动态参数通过在优化数值条件的周期激励轨迹上收集的数据进行识别,并通过具有物理可行性约束的凸优化进行估计。利用识别出的模型,我们在 dVRK 控制堆栈中实现了实时重力补偿和计算扭矩前馈。在物理 dVRK-Si 上的实验表明,重力补偿将稳态关节误差降低了 68-84%,并将静态保持期间末端执行器尖端漂移从 4.2 mm 降低到 0.7 mm。计算扭矩前馈进一步提高了瞬态和位置跟踪精度。在正弦轨迹跟踪中,计算扭矩前馈相比仅重力前馈减少了 35% 的位置误差,相比仅 PID 前馈减少了 40%。所提出的流程支持在 dVRK-Si 上的可靠控制、高保真仿真和基于学习的自动化。
cs.RO / 41 / 2603.12120
CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance
CRAFT:一种具有混合硬软顺应性的腱驱动手
Abstract
We introduce CRAFT hand, a tendon-driven anthropomorphic hand with hybrid hard-soft compliance for contact-rich manipulation. The design is based on a simple idea: contact is not uniform across the hand. Impacts concentrate at joints, while links carry most of the load. CRAFT places soft material at joints and keeps links rigid, and uses rollingcontact joint surfaces to keep flexion on repeatable motion paths. Fifteen motors mounted on the fingers drive the hand through tendons, keeping the form factor compact and the fingers light. In structural tests, CRAFT improves strength and endurance while maintaining comparable repeatability. In teleoperation, CRAFT improves handling of fragile and low-friction items, and the hand covers 33/33 grasps in the Feix taxonomy. The full design costs under $600 and will be released open-source with visionbased teleoperation and simulation integration. Project page: http://craft-hand.github.io/
Chinese Translation
我们介绍了CRAFT手,这是一种腱驱动的人形手,具有混合硬软顺应性,适用于接触丰富的操作。设计基于一个简单的理念:手部的接触并不均匀。冲击集中在关节上,而连杆承载大部分负载。CRAFT在关节处使用软材料,并保持连杆的刚性,同时采用滚动接触关节表面以保持屈曲在可重复的运动路径上。安装在手指上的十五个电机通过腱驱动手部,保持了紧凑的外形和轻便的手指。在结构测试中,CRAFT提高了强度和耐久性,同时保持了可比的重复性。在远程操作中,CRAFT改善了对易碎和低摩擦物品的处理,并在Feix分类法中覆盖了33/33种抓取方式。完整设计成本低于600美元,并将以开源形式发布,包含基于视觉的远程操作和仿真集成。项目页面:http://craft-hand.github.io/
cs.RO / 42 / 2603.12185
ComFree-Sim: A GPU-Parallelized Analytical Contact Physics Engine for Scalable Contact-Rich Robotics Simulation and Control
ComFree-Sim:一种用于可扩展接触丰富机器人仿真与控制的GPU并行分析接触物理引擎
Abstract
Physics simulation for contact-rich robotics is often bottlenecked by contact resolution: mainstream engines enforce non-penetration and Coulomb friction via complementarity constraints or constrained optimization, requiring per-step iterative solves whose cost grows superlinearly with contact density. We present ComFree-Sim, a GPU-parallelized analytical contact physics engine built on complementarity-free contact modeling. ComFree-Sim computes contact impulses in closed form via an impedance-style prediction--correction update in the dual cone of Coulomb friction. Contact computation decouples across contact pairs and becomes separable across cone facets, mapping naturally to GPU kernels and yielding near-linear runtime scaling with the number of contacts. We further extend the formulation to a unified 6D contact model capturing tangential, torsional, and rolling friction, and introduce a practical dual-cone impedance heuristic. ComFree-Sim is implemented in Warp and exposed through a MuJoCo-compatible interface as a drop-in backend alternative to MuJoCo Warp (MJWarp). Experiments benchmark penetration, friction behaviors, stability, and simulation runtime scaling against MJWarp, demonstrating near-linear scaling and 2--3 times higher throughput in dense contact scenes with comparable physical fidelity. We deploy ComFree-Sim in real-time MPC for in-hand dexterous manipulation on a real-world multi-fingered LEAP hand and in dynamics-aware motion retargeting, demonstrating that low-latency simulation yields higher closed-loop success rates and enables practical high-frequency control in contact-rich tasks.
Chinese Translation
接触丰富机器人物理仿真常常受到接触解析的瓶颈限制:主流引擎通过互补约束或约束优化强制执行非穿透和库仑摩擦,这需要每一步的迭代求解,其成本随着接触密度的增加而超线性增长。我们提出了ComFree-Sim,一种基于无互补接触建模的GPU并行分析接触物理引擎。ComFree-Sim通过在库仑摩擦的对偶锥中进行阻抗式预测-修正更新,以封闭形式计算接触冲量。接触计算在接触对之间解耦,并在锥面之间可分离,自然映射到GPU内核,并在接触数量增加时实现近线性运行时扩展。我们进一步扩展了该公式,形成一个统一的6D接触模型,捕捉切向、扭转和滚动摩擦,并引入了一种实用的双锥阻抗启发式方法。ComFree-Sim在Warp中实现,并通过与MuJoCo兼容的接口暴露,作为MuJoCo Warp (MJWarp) 的替代后端。实验对比了与MJWarp的穿透、摩擦行为、稳定性和仿真运行时扩展,展示了在密集接触场景中近线性扩展和2-3倍更高的吞吐量,同时保持相当的物理逼真度。我们在真实的多指LEAP手上进行实时模型预测控制(MPC)以实现手中灵巧操作,并在动态感知的运动重定向中部署ComFree-Sim,证明低延迟仿真能够提高闭环成功率,并在接触丰富任务中实现实用的高频控制。
cs.RO / 43 / 2603.12193
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
SaPaVe:面向机器人视觉-语言-动作模型中的主动感知与操控
Abstract
Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(\pi_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe
Chinese Translation
主动感知与操控对于机器人与复杂场景的交互至关重要。现有方法在将语义驱动的主动感知与稳健的、视角不变的执行统一起来时面临挑战。我们提出了SaPaVe,一个端到端的框架,以数据高效的方式共同学习这些能力。我们的方法将相机和操控动作解耦,而不是将它们置于共享的动作空间,并采用自下而上的训练策略:我们首先在大规模数据集上训练语义相机控制,然后使用混合数据共同优化这两种动作类型。为了支持这一框架,我们引入了ActiveViewPose-200K,一个包含20万个图像-语言-相机运动对的数据集,用于语义相机运动学习,以及一个3D几何感知模块,以提高在动态视角下的执行稳健性。我们还提出了ActiveManip-Bench,这是第一个用于评估超越固定视角设置的主动操控的基准。大量在模拟和现实环境中的实验表明,SaPaVe在现实任务中超越了最近的视觉-语言-动作模型,如GR00T N1和C0_0,成功率提高了多达31.25%。这些结果表明,当采用解耦但协调的策略进行训练时,紧密耦合的感知与执行能够实现高效且具有可推广性的主动操控。项目页面:https://lmzpai.github.io/SaPaVe
cs.RO / 44 / 2603.12243
HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
HandelBot:通过快速适应灵巧机器人策略实现现实世界的钢琴演奏
Abstract
Mastering dexterous manipulation with multi-fingered hands has been a grand challenge in robotics for decades. Despite its potential, the difficulty of collecting high-quality data remains a primary bottleneck for high-precision tasks. While reinforcement learning and simulation-to-real-world transfer offer a promising alternative, the transferred policies often fail for tasks demanding millimeter-scale precision, such as bimanual piano playing. In this work, we introduce HandelBot, a framework that combines a simulation policy and rapid adaptation through a two-stage pipeline. Starting from a simulation-trained policy, we first apply a structured refinement stage to correct spatial alignments by adjusting lateral finger joints based on physical rollouts. Next, we use residual reinforcement learning to autonomously learn fine-grained corrective actions. Through extensive hardware experiments across five recognized songs, we demonstrate that HandelBot can successfully perform precise bimanual piano playing. Our system outperforms direct simulation deployment by a factor of 1.8x and requires only 30 minutes of physical interaction data.
Chinese Translation
掌握多指手的灵巧操作一直是机器人技术中的一项重大挑战。尽管其潜力巨大,但收集高质量数据的难度仍然是高精度任务的主要瓶颈。虽然强化学习和模拟到现实世界的转移提供了一个有前景的替代方案,但转移后的策略往往在需要毫米级精度的任务中失败,例如双手钢琴演奏。在本研究中,我们介绍了HandelBot,一个结合了模拟策略和通过两阶段流程快速适应的框架。我们首先从一个经过模拟训练的策略开始,应用结构化的精细化阶段,通过基于物理模拟的调整来修正空间对齐,调整侧向手指关节。接下来,我们使用残差强化学习自主学习细粒度的纠正动作。通过对五首知名歌曲进行广泛的硬件实验,我们证明了HandelBot能够成功执行精确的双手钢琴演奏。我们的系统在直接模拟部署中表现出1.8倍的优越性,并且只需30分钟的物理交互数据。
cs.RO / 45 / 2603.12260
HumDex:Humanoid Dexterous Manipulation Made Easy
HumDex:人形机器人灵巧操作的简化实现
Abstract
This paper investigates humanoid whole-body dexterous manipulation, where the efficient collection of high-quality demonstration data remains a central bottleneck. Existing teleoperation systems often suffer from limited portability, occlusion, or insufficient precision, which hinders their applicability to complex whole-body tasks. To address these challenges, we introduce HumDex, a portable teleoperation system designed for humanoid whole-body dexterous manipulation. Our system leverages IMU-based motion tracking to address the portability-precision trade-off, enabling accurate full-body tracking while remaining easy to deploy. For dexterous hand control, we further introduce a learning-based retargeting method that generates smooth and natural hand motions without manual parameter tuning. Beyond teleoperation, HumDex enables efficient collection of human motion data. Building on this capability, we propose a two-stage imitation learning framework that first pre-trains on diverse human motion data to learn generalizable priors, and then fine-tunes on robot data to bridge the embodiment gap for precise execution. We demonstrate that this approach significantly improves generalization to new configurations, objects, and backgrounds with minimal data acquisition costs. The entire system is fully reproducible and open-sourced at https://github.com/physical-superintelligence-lab/HumDex.
Chinese Translation
本文研究了人形机器人全身灵巧操作,其中高质量演示数据的高效收集仍然是一个主要瓶颈。现有的遥操作系统往往受到便携性、遮挡或精度不足的限制,这妨碍了它们在复杂全身任务中的应用。为了解决这些挑战,我们引入了HumDex,一个专为人形机器人全身灵巧操作设计的便携式遥操作系统。我们的系统利用基于惯性测量单元(IMU)的运动追踪来解决便携性与精度之间的权衡,使得在保持易于部署的同时实现准确的全身追踪。为了实现灵巧的手部控制,我们进一步提出了一种基于学习的重定向方法,该方法能够生成平滑自然的手部动作,而无需手动调整参数。除了遥操作,HumDex还能够高效地收集人类运动数据。在此基础上,我们提出了一个两阶段的模仿学习框架,首先在多样的人类运动数据上进行预训练,以学习可泛化的先验知识,然后在机器人数据上进行微调,以弥合体现差距,实现精确执行。我们展示了这种方法显著提高了对新配置、物体和背景的泛化能力,同时数据获取成本最低。整个系统完全可复现,并已开源于 https://github.com/physical-superintelligence-lab/HumDex。
cs.RO / 46 / 2603.12263
$\Psi_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation
$ ext{Ψ}_0$: 一个面向通用类人机器人运动操控的开放基础模型
Abstract
We introduce $\Psi_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, \ours\;decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: First, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations. Then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that \ours\ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10$\times$ as much data by over 40\% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.
Chinese Translation
我们介绍了$ ext{Ψ}_0$(Psi-Zero),这是一个开放基础模型,旨在解决具有挑战性的类人机器人运动操控任务。虽然现有方法通常试图通过在大量多样的人类和类人数据上进行共同训练来解决这一基本问题,但我们认为这种策略由于人类与类人机器人之间的基本运动学和运动差异而并不理想。因此,尽管数据量可观,数据效率和模型性能仍然不尽如人意。为了解决这一挑战, extit{ours}将学习过程解耦,以最大化异构数据源的效用。具体而言,我们提出了一种分阶段的训练范式,具有不同的学习目标:首先,我们在大规模自我中心人类视频上自回归地预训练一个视觉语言模型(VLM)主干,以获取可泛化的视觉-动作表示。然后,我们在高质量的类人机器人数据上进行后训练,以学习精确的机器人关节控制。我们的研究进一步识别出一个关键但常被忽视的数据配方:与依赖于嘈杂的互联网剪辑或异构跨体现机器人数据集的方法相比,我们证明了在高质量自我中心人类操控数据上进行预训练,然后在特定领域的真实类人轨迹上进行后训练,能够获得更优的性能。大量的真实世界实验表明, extit{ours}在仅使用约800小时的人类视频数据和30小时的真实机器人数据的情况下,达到了最佳性能,在多个任务中超越了在超过10倍数据上进行预训练的基线,整体成功率提高了40%以上。我们将向社区开源整个生态系统,包括数据处理和训练管道、类人基础模型以及实时动作推理引擎。
cs.CV / 1 / 2603.11106
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation
RC-NF:用于机器人操作实时异常检测的机器人条件化归一化流
Abstract
Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.
Chinese Translation
近年来,视觉-语言-动作(VLA)模型的进展使得机器人能够执行越来越复杂的任务。然而,通过模仿学习训练的VLA模型在动态环境中往往难以可靠运行,并且在分布外(OOD)条件下常常失败。为了解决这一问题,我们提出了机器人条件化归一化流(RC-NF),这是一种用于机器人异常检测和干预的实时监测模型,确保机器人状态与物体运动轨迹与任务相一致。RC-NF在归一化流中解耦了任务感知的机器人和物体状态的处理。它仅需正样本进行无监督训练,并通过概率密度函数在推理过程中计算准确的机器人异常分数。我们进一步提出了LIBERO-Anomaly-10,这是一个包含三类机器人异常的基准,用于仿真评估。与之前的监测机器人任务的方法相比,RC-NF在所有异常类型上均实现了最先进的性能。现实世界的实验表明,RC-NF作为VLA模型(例如,pi0)的即插即用模块运行,提供实时的OOD信号,在必要时能够实现状态级回滚或任务级重新规划,响应延迟低于100毫秒。这些结果表明,RC-NF显著增强了基于VLA的机器人系统在动态环境中的鲁棒性和适应性。
cs.CV / 2 / 2603.11174
GGPT: Geometry Grounded Point Transformer
GGPT:基于几何的点变换器
Abstract
Recent feed-forward networks have achieved remarkable progress in sparse-view 3D reconstruction by predicting dense point maps directly from RGB images. However, they often suffer from geometric inconsistencies and limited fine-grained accuracy due to the absence of explicit multi-view constraints. We introduce the Geometry-Grounded Point Transformer (GGPT), a framework that augments feed-forward reconstruction with reliable sparse geometric guidance. We first propose an improved Structure-from-Motion pipeline based on dense feature matching and lightweight geometric optimisation to efficiently estimate accurate camera poses and partial 3D point clouds from sparse input views. Building on this foundation, we propose a geometry-guided 3D point transformer that refines dense point maps under explicit partial-geometry supervision using an optimised guidance encoding. Extensive experiments demonstrate that our method provides a principled mechanism for integrating geometric priors with dense feed-forward predictions, producing reconstructions that are both geometrically consistent and spatially complete, recovering fine structures and filling gaps in textureless areas. Trained solely on ScanNet++ with VGGT predictions, GGPT generalises across architectures and datasets, substantially outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings.
Chinese Translation
最近的前馈网络在稀疏视图的三维重建中取得了显著进展,通过直接从RGB图像预测密集点图。然而,由于缺乏明确的多视图约束,它们往往面临几何不一致和有限的细粒度精度的问题。我们提出了几何基础点变换器(Geometry-Grounded Point Transformer,GGPT),这是一个通过可靠的稀疏几何指导增强前馈重建的框架。我们首先提出了一种改进的运动结构(Structure-from-Motion)流程,基于密集特征匹配和轻量级几何优化,来高效估计准确的相机位姿和来自稀疏输入视图的部分三维点云。在此基础上,我们提出了一种几何引导的三维点变换器,利用优化的引导编码在明确的部分几何监督下细化密集点图。大量实验表明,我们的方法为将几何先验与密集前馈预测相结合提供了一种原则性机制,生成的重建在几何上是一致的,并且在空间上是完整的,能够恢复细微结构并填补无纹理区域的空白。GGPT仅在ScanNet++和VGGT预测上进行训练,能够跨架构和数据集进行泛化,在领域内和领域外的设置中显著超越了最先进的前馈三维重建模型。
cs.CV / 3 / 2603.11206
Evidential learning driven Breast Tumor Segmentation with Stage-divided Vision-Language Interaction
基于证据学习的阶段划分视觉-语言交互驱动的乳腺肿瘤分割
Abstract
Breast cancer is one of the most common causes of death among women worldwide, with millions of fatalities annually. Magnetic Resonance Imaging (MRI) can provide various sequences for characterizing tumor morphology and internal patterns, and becomes an effective tool for detection and diagnosis of breast tumors. However, previous deep-learning based tumor segmentation methods have limitations in accurately locating tumor contours due to the challenge of low contrast between cancer and normal areas and blurred boundaries. Leveraging text prompt information holds promise in ameliorating tumor segmentation effect by delineating segmentation regions. Inspired by this, we propose text-guided Breast Tumor Segmentation model (TextBCS) with stage-divided vision-language interaction and evidential learning. Specifically, the proposed stage-divided vision-language interaction facilitates information mutual between visual and text features at each stage of down-sampling, further exerting the advantages of text prompts to assist in locating lesion areas in low contrast scenarios. Moreover, the evidential learning is adopted to quantify the segmentation uncertainty of the model for blurred boundary. It utilizes the variational Dirichlet to characterize the distribution of the segmentation probabilities, addressing the segmentation uncertainties of the boundaries. Extensive experiments validate the superiority of our TextBCS over other segmentation networks, showcasing the best breast tumor segmentation performance on publicly available datasets.
Chinese Translation
乳腺癌是全球女性最常见的死亡原因之一,每年导致数百万例死亡。磁共振成像(MRI)能够提供多种序列以表征肿瘤形态和内部特征,成为检测和诊断乳腺肿瘤的有效工具。然而,以往基于深度学习的肿瘤分割方法在准确定位肿瘤轮廓方面存在局限性,主要由于癌症与正常区域之间的低对比度和模糊边界的挑战。利用文本提示信息有望通过描绘分割区域来改善肿瘤分割效果。受到这一启发,我们提出了文本引导的乳腺肿瘤分割模型(TextBCS),其具有阶段划分的视觉-语言交互和证据学习。具体而言,所提出的阶段划分视觉-语言交互在每个下采样阶段促进视觉特征与文本特征之间的信息互通,进一步发挥文本提示的优势,帮助在低对比度场景中定位病变区域。此外,采用证据学习来量化模型在模糊边界下的分割不确定性。该方法利用变分狄利克雷(variational Dirichlet)来表征分割概率的分布,解决边界分割的不确定性。大量实验验证了我们的TextBCS相较于其他分割网络的优越性,展示了在公开数据集上最佳的乳腺肿瘤分割性能。
cs.CV / 4 / 2603.11211
A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters
通过具有非线性多适配器的视觉-语言模型实现简单高效的增量学习框架
Abstract
Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).
Chinese Translation
增量学习(IL)旨在在保留先前获得知识的同时学习新任务。将预训练视觉-语言模型的零样本学习能力整合到增量学习方法中已标志着一个重要的进展。然而,这些方法面临三个主要挑战:(1)需要提高训练效率;(2)依赖于内存库存储先前数据;(3)需要强大的主干网络以增强模型的能力。本文提出了SimE,一个简单高效的框架,采用专门为增量学习任务设计的适配器的视觉-语言模型。我们报告了一个显著现象:适配器连接数量与模型的增量学习能力之间存在非线性相关性。虽然在变换器块之间增加适配器连接可以提高模型性能,但在较小的增量步骤中,增加变换器块内的适配器连接并不会增强模型的增量学习能力,甚至可能降低其能力。大量实验结果表明,SimE在TinyImageNet上超越传统方法9.6%,在CIFAR-100上优于其他基于CLIP的方法5.3%。此外,我们还进行了系统研究,以增强CLIP的零样本能力的利用。我们建议用在更大数据集(例如LAION2B)和更强架构(例如ViT-L/14)上训练的CLIP模型替换SimE的编码器。
cs.CV / 5 / 2603.11219
Senna-2: Aligning VLM and End-to-End Driving Policy for Consistent Decision Making and Planning
Senna-2:对齐视觉语言模型与端到端驾驶策略以实现一致的决策和规划
Abstract
Vision-language models (VLMs) enhance the planning capability of end-to-end (E2E) driving policy by leveraging high-level semantic reasoning. However, existing approaches often overlook the dual-system consistency between VLM's high-level decision and E2E's low-level planning. As a result, the generated trajectories may misalign with the intended driving decisions, leading to weakened top-down guidance and decision-following ability of the system. To address this issue, we propose Senna-2, an advanced VLM-E2E driving policy that explicitly aligns the two systems for consistent decision-making and planning. Our method follows a consistency-oriented three-stage training paradigm. In the first stage, we conduct driving pre-training to achieve preliminary decision-making and planning, with a decision adapter transmitting VLM decisions to E2E policy in the form of implicit embeddings. In the second stage, we align the VLM and the E2E policy in an open-loop setting. In the third stage, we perform closed-loop alignment via bottom-up Hierarchical Reinforcement Learning in 3DGS environments to reinforce the safety and efficiency. Extensive experiments demonstrate that Senna-2 achieves superior dual-system consistency (19.3% F1 score improvement) and significantly enhances driving safety in both open-loop (5.7% FDE reduction) and closed-loop settings (30.6% AF-CR reduction).
Chinese Translation
视觉语言模型(VLM)通过利用高层语义推理增强了端到端(E2E)驾驶策略的规划能力。然而,现有方法往往忽视了VLM的高层决策与E2E的低层规划之间的双系统一致性。因此,生成的轨迹可能与预期的驾驶决策不一致,从而削弱了系统的自上而下指导和决策跟随能力。为了解决这一问题,我们提出了Senna-2,一种先进的VLM-E2E驾驶策略,明确对齐这两个系统以实现一致的决策和规划。我们的方法遵循以一致性为导向的三阶段训练范式。在第一阶段,我们进行驾驶预训练,以实现初步的决策和规划,决策适配器以隐式嵌入的形式将VLM决策传递给E2E策略。在第二阶段,我们在开放环路设置中对齐VLM和E2E策略。在第三阶段,我们通过在3DGS环境中进行自下而上的层次强化学习实现闭环对齐,以增强安全性和效率。大量实验表明,Senna-2在双系统一致性方面取得了优越的表现(F1得分提高19.3%),并显著提高了开放环路(FDE减少5.7%)和闭环设置(AF-CR减少30.6%)下的驾驶安全性。
cs.CV / 6 / 2603.11220
Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models
频率调制视觉恢复用于套娃大型多模态模型
Abstract
Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
Chinese Translation
大型多模态模型(LMMs)因视觉标记数量众多而难以适应不同的计算预算。之前的方法试图在大型语言模型(LLMs)之前或之中减少视觉标记的数量。然而,这些策略不可避免地导致视觉语义的丧失。为了解决这些问题,我们提出了FMVR,一种即插即用且极其简单的频率调制视觉恢复策略,以增强LMMs在视觉标记减少情况下的推理能力。具体而言,FMVR通过AvgPool和MaxPool将较少的视觉标记的视觉表示解耦为低频和高频成分。随后,利用轻量级可学习参数对导出的频率进行调制。来自AvgPool的高频成分作为显著性滤波器,以增强显著性视觉语义,而来自MaxPool的低频成分则作为反显著性滤波器,以增强弱视觉语义。它使得由少量视觉标记主导的视觉语义得以保留,并恢复稀释的视觉语义。此外,我们将FMVR注入到套娃表示学习中,以学习粗到细的视觉标记集,从而在推理过程中灵活调整视觉标记的数量,同时保持相当的性能。在10个基于图像和4个基于视频的基准测试中的实验表明,FMVR-LLaVA将LLaVA-1.5-7B的FLOPs减少了89%,同时保持几乎100%的原始准确率。代码将公开。
cs.CV / 7 / 2603.11246
When Slots Compete: Slot Merging in Object-Centric Learning
当槽位竞争:面向对象学习中的槽位合并
Abstract
Slot-based object-centric learning represents an image as a set of latent slots with a decoder that combines them into an image or features. The decoder specifies how slots are combined into an output, but the slot set is typically fixed: the number of slots is chosen upfront and slots are only refined. This can lead to multiple slots competing for overlapping regions of the same entity rather than focusing on distinct regions. We introduce slot merging: a drop-in, lightweight operation on the slot set that merges overlapping slots during training. We quantify overlap with a Soft-IoU score between slot-attention maps and combine selected pairs via a barycentric update that preserves gradient flow. Merging follows a fixed policy, with the decision threshold inferred from overlap statistics, requiring no additional learnable modules. Integrated into the established feature-reconstruction pipeline of DINOSAUR, the proposed method improves object factorization and mask quality, surpassing other adaptive methods in object discovery and segmentation benchmarks.
Chinese Translation
基于槽位的面向对象学习将图像表示为一组潜在槽位,并通过解码器将其组合成图像或特征。解码器指定了槽位如何组合成输出,但槽位集通常是固定的:槽位的数量在前期选择,并且槽位仅进行细化。这可能导致多个槽位竞争同一实体的重叠区域,而不是专注于不同的区域。我们引入了槽位合并:一种在训练过程中对槽位集进行的轻量级操作,能够合并重叠的槽位。我们通过槽位注意力图之间的Soft-IoU分数量化重叠,并通过保持梯度流的重心更新组合选定的槽位对。合并遵循固定策略,决策阈值由重叠统计推断而来,无需额外的可学习模块。将该方法集成到DINOSAUR的既定特征重建流程中,显著改善了对象因子化和掩膜质量,在对象发现和分割基准测试中超越了其他自适应方法。
cs.CV / 8 / 2603.11252
Radiometric fingerprinting of object surfaces using mobile laser scanning and semantic 3D road space models
利用移动激光扫描和语义3D道路空间模型对物体表面进行辐射特征识别
Abstract
Although semantic 3D city models are internationally available and becoming increasingly detailed, the incorporation of material information remains largely untapped. However, a structured representation of materials and their physical properties could substantially broaden the application spectrum and analytical capabilities for urban digital twins. At the same time, the growing number of repeated mobile laser scans of cities and their street spaces yields a wealth of observations influenced by the material characteristics of the corresponding surfaces. To leverage this information, we propose radiometric fingerprints of object surfaces by grouping LiDAR observations reflected from the same semantic object under varying distances, incident angles, environmental conditions, sensors, and scanning campaigns. Our study demonstrates how 312.4 million individual beams acquired across four campaigns using five LiDAR sensors on the Audi Autonomous Driving Dataset (A2D2) vehicle can be automatically associated with 6368 individual objects of the semantic 3D city model. The model comprises a comprehensive and semantic representation of four inner-city streets at Level of Detail (LOD) 3 with centimeter-level accuracy. It is based on the CityGML 3.0 standard and enables fine-grained sub-differentiation of objects. The extracted radiometric fingerprints for object surfaces reveal recurring intra-class patterns that indicate class-dominant materials. The semantic model, the method implementations, and the developed geodatabase solution 3DSensorDB are released under: https://github.com/tum-gis/sensordb
Chinese Translation
尽管语义3D城市模型在国际上已广泛可用且日益详细,但材料信息的纳入仍然未被充分利用。然而,材料及其物理特性的结构化表示可以显著拓宽城市数字双胞胎的应用范围和分析能力。同时,城市及其街道空间的重复移动激光扫描数量不断增加,产生了大量受相应表面材料特性影响的观测数据。为了利用这些信息,我们提出通过将来自同一语义对象的LiDAR观测按不同距离、入射角、环境条件、传感器和扫描活动进行分组,来生成物体表面的辐射特征。我们的研究展示了如何将通过五个LiDAR传感器在Audi Autonomous Driving Dataset (A2D2)车辆上进行的四次扫描活动中获取的3.124亿个单独光束,自动关联到6368个语义3D城市模型的个体对象。该模型包含了四条市内街道的全面且语义化的表示,具有厘米级精度,符合CityGML 3.0标准,并允许对对象进行细粒度的子区分。提取的物体表面辐射特征揭示了表内类模式的重复出现,指示出类主导材料。语义模型、方法实现以及开发的地理数据库解决方案3DSensorDB已发布于:https://github.com/tum-gis/sensordb
cs.CV / 9 / 2603.11257
Towards Automated Initial Probe Placement in Transthoracic Teleultrasound Using Human Mesh and Skeleton Recovery
基于人类网格和骨骼恢复的经胸远程超声自动初始探头放置研究
Abstract
Cardiac and lung ultrasound are technically demanding because operators must identify patient-specific intercostal acoustic windows and then navigate between standard views by adjusting probe position, rotation, and force across different imaging planes. These challenges are amplified in teleultrasound when a novice or robot faces the difficult task of first placing the probe on the patient without in-person expert assistance. We present a framework for automating Patient registration and anatomy-informed Initial Probe placement Guidance (PIPG) using only RGB images from a calibrated camera. The novice first captures the patient using the camera on a mixed reality (MR) head-mounted display (HMD). An edge server then infers a patient-specific body-surface and skeleton model, with spatial smoothing across multiple views. Using bony landmarks from the predicted skeleton, we estimate the intercostal region and project the guidance back onto the reconstructed body surface. To validate the framework, we overlaid the reconstructed body mesh and the virtual probe pose guidance across multiple transthoracic echocardiography scan planes in situ and measured the quantitative placement error. Pilot experiments with healthy volunteers suggest that the proposed probe placement prediction and MR guidance yield consistent initial placement within anatomical variability acceptable for teleultrasound setup
Chinese Translation
心脏和肺部超声技术要求高,因为操作人员必须识别特定于患者的肋间声学窗口,然后通过在不同成像平面上调整探头的位置、旋转和施加的力来在标准视图之间导航。当新手或机器人在没有现场专家协助的情况下首次将探头放置在患者身上时,这些挑战在远程超声中更加突出。我们提出了一种框架,用于自动化患者注册和基于解剖信息的初始探头放置指导(Patient registration and anatomy-informed Initial Probe placement Guidance, PIPG),仅使用来自校准相机的RGB图像。新手首先使用混合现实(Mixed Reality, MR)头戴显示器(HMD)上的相机捕捉患者。然后,边缘服务器推断出特定于患者的体表和骨骼模型,并在多个视图之间进行空间平滑。利用预测骨骼中的骨性标志点,我们估计肋间区域并将指导投影回重建的体表。为了验证该框架,我们在多个经胸超声心动图扫描平面上叠加了重建的体网格和虚拟探头姿态指导,并测量了定量放置误差。对健康志愿者的初步实验表明,所提出的探头放置预测和MR指导在解剖变异性可接受的范围内实现了一致的初始放置,适用于远程超声设置。
cs.CV / 10 / 2603.11298
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction
InstantHDR:用于高动态范围三维重建的单次前向高斯点云渲染
Abstract
High dynamic range (HDR) novel view synthesis (NVS) aims to reconstruct HDR scenes from multi-exposure low dynamic range (LDR) images. Existing HDR pipelines heavily rely on known camera poses, well-initialized dense point clouds, and time-consuming per-scene optimization. Current feed-forward alternatives overlook the HDR problem by assuming exposure-invariant appearance. To bridge this gap, we propose InstantHDR, a feed-forward network that reconstructs 3D HDR scenes from uncalibrated multi-exposure LDR collections in a single forward pass. Specifically, we design a geometry-guided appearance modeling for multi-exposure fusion, and a meta-network for generalizable scene-specific tone mapping. Due to the lack of HDR scene data, we build a pre-training dataset, called HDR-Pretrain, for generalizable feed-forward HDR models, featuring 168 Blender-rendered scenes, diverse lighting types, and multiple camera response functions. Comprehensive experiments show that our InstantHDR delivers comparable synthesis performance to the state-of-the-art optimization-based HDR methods while enjoying $\sim700\times$ and $\sim20\times$ reconstruction speed improvement with our single-forward and post-optimization settings. All code, models, and datasets will be released after the review process.
Chinese Translation
高动态范围(HDR)新视角合成(NVS)旨在从多曝光低动态范围(LDR)图像中重建HDR场景。现有的HDR处理流程严重依赖已知的相机位姿、良好初始化的稠密点云以及耗时的逐场景优化。当前的前馈替代方案通过假设曝光不变外观来忽视HDR问题。为了解决这一问题,我们提出了InstantHDR,一种前馈网络,可以在单次前向传递中从未校准的多曝光LDR集合中重建三维HDR场景。具体而言,我们设计了一种基于几何引导的外观建模用于多曝光融合,以及一个用于可泛化场景特定色调映射的元网络。由于缺乏HDR场景数据,我们构建了一个名为HDR-Pretrain的预训练数据集,旨在为可泛化的前馈HDR模型提供支持,数据集包含168个Blender渲染的场景、多样的光照类型和多种相机响应函数。全面的实验表明,我们的InstantHDR在合成性能上与最先进的基于优化的HDR方法相当,同时在我们的单次前向和后优化设置中实现了约700倍和约20倍的重建速度提升。所有代码、模型和数据集将在审稿过程结束后发布。
cs.CV / 11 / 2603.11306
Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild
针对野外环境中鲁棒的多模态动作单元检测的层次粒度对齐与状态空间建模
Abstract
Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local active patches. Furthermore, we overcome the receptive field limitations of conventional temporal convolutional networks by introducing a Vision-Mamba architecture. This approach enables temporal modeling with O(N) linear complexity, effectively capturing ultra-long-range dynamics without performance degradation. A novel asymmetric cross-attention mechanism is also introduced to deeply synchronize paralinguistic audio cues with subtle visual movements.Extensive experiments on the challenging Aff-Wild2 dataset demonstrate that our approach significantly outperforms existing baselines, achieving state-of-the-art performance. Notably, this framework secured top rankings in the AU Detection track of the 10th Affective Behavior Analysis in-the-wild Competition.
Chinese Translation
在野外环境中,面部动作单元(AU)检测仍然是一项艰巨的挑战,主要由于严重的时空异质性、无约束的姿态以及复杂的音视频依赖关系。尽管最近的多模态方法取得了一定进展,但它们通常依赖于容量有限的编码器和浅层融合机制,无法捕捉细粒度的语义变化和超长时间上下文。为了解决这一问题,我们提出了一种新颖的多模态框架,基于层次粒度对齐和状态空间模型。具体而言,我们利用强大的基础模型,即 DINOv2 和 WavLM,提取鲁棒且高保真的视觉和音频表示,有效替代传统特征提取器。为了处理极端的面部变化,我们的层次粒度对齐模块动态地将全局面部语义与细粒度的局部活跃补丁对齐。此外,我们通过引入 Vision-Mamba 架构克服了传统时间卷积网络的感受野限制。这种方法以 O(N) 的线性复杂度实现时间建模,有效捕捉超长范围的动态而不降低性能。我们还引入了一种新颖的非对称交叉注意力机制,以深入同步副语言音频线索与细微的视觉运动。在具有挑战性的 Aff-Wild2 数据集上的大量实验表明,我们的方法显著优于现有基准,达到了最先进的性能。值得注意的是,该框架在第十届野外情感行为分析竞赛的 AU 检测赛道中获得了最高排名。
cs.CV / 12 / 2603.11320
UniCompress: Token Compression for Unified Vision-Language Understanding and Generation
UniCompress:用于统一视觉-语言理解与生成的令牌压缩
Abstract
Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and cross-modal synergy, which facilitates shared parameterization, consistent training objectives, and seamless transfer between modalities. However, the large number of visual tokens required by such models introduces substantial computation and memory overhead, and this inefficiency directly hinders deployment in resource constrained scenarios such as embodied AI systems. In this work, we propose a unified token compression algorithm UniCompress that significantly reduces visual token count while preserving performance on both image understanding and generation tasks. Our method introduces a plug-in compression and decompression mechanism guided with learnable global meta tokens. The framework is lightweight and modular, enabling efficient integration into existing models without full retraining. Experimental results show that our approach reduces image tokens by up to 4 times, achieves substantial gains in inference latency and training cost, and incurs only minimal performance degradation, which demonstrates the promise of token-efficient unified modeling for real world multimodal applications.
Chinese Translation
统一模型旨在通过将图像编码为离散令牌,并在单一自回归框架内与文本一起处理,从而支持理解和生成。这种统一设计提供了架构上的简洁性和跨模态的协同效应,促进了共享参数化、一致的训练目标以及模态之间的无缝转移。然而,这类模型所需的大量视觉令牌引入了显著的计算和内存开销,这种低效直接阻碍了在资源受限场景(如具身人工智能系统)中的部署。在本研究中,我们提出了一种统一的令牌压缩算法UniCompress,该算法在保持图像理解和生成任务性能的同时显著减少了视觉令牌的数量。我们的方法引入了一种插件式的压缩和解压机制,采用可学习的全局元令牌进行引导。该框架轻量且模块化,能够高效地集成到现有模型中,而无需进行全面的重新训练。实验结果表明,我们的方法将图像令牌数量减少了多达4倍,在推理延迟和训练成本上取得了显著的提升,并且仅产生了最小的性能下降,这展示了令牌高效统一建模在现实世界多模态应用中的潜力。
cs.CV / 13 / 2603.11323
UNet-AF: An alias-free UNet for image restoration
UNet-AF:一种无别名的 UNet 用于图像恢复
Abstract
The simplicity and effectiveness of the UNet architecture makes it ubiquitous in image restoration, image segmentation, and diffusion models. They are often assumed to be equivariant to translations, yet they traditionally consist of layers that are known to be prone to aliasing, which hinders their equivariance in practice. To overcome this limitation, we propose a new alias-free UNet designed from a careful selection of state-of-the-art translation-equivariant layers. We evaluate the proposed equivariant architecture against non-equivariant baselines on image restoration tasks and observe competitive performance with a significant increase in measured equivariance. Through extensive ablation studies, we also demonstrate that each change is crucial for its empirical equivariance. Our implementation is available at https://github.com/jscanvic/UNet-AF
Chinese Translation
UNet 架构的简单性和有效性使其在图像恢复、图像分割和扩散模型中广泛应用。尽管通常假设其对平移具有等变性,但其传统结构中包含的层往往容易产生别名现象,这在实践中妨碍了其等变性。为克服这一限制,我们提出了一种新的无别名 UNet,旨在通过精心选择最先进的平移等变层来设计。我们在图像恢复任务中将所提等变架构与非等变基线进行评估,观察到其具有竞争力的性能,并在测量的等变性上显著提升。通过广泛的消融研究,我们还证明了每一项改动对其实证等变性至关重要。我们的实现可在 https://github.com/jscanvic/UNet-AF 获取。
cs.CV / 14 / 2603.11325
Towards Trustworthy Selective Generation: Reliability-Guided Diffusion for Ultra-Low-Field to High-Field MRI Synthesis
迈向可信的选择性生成:用于超低场到高场MRI合成的可靠性引导扩散
Abstract
Low-field to high-field MRI synthesis has emerged as a cost-effective strategy to enhance image quality under hardware and acquisition constraints, particularly in scenarios where access to high-field scanners is limited or impractical. Despite recent progress in diffusion models, diffusion-based approaches often struggle to balance fine-detail recovery and structural fidelity. In particular, the uncontrolled generation of high-resolution details in structurally ambiguous regions may introduce anatomically inconsistent patterns, such as spurious edges or artificial texture variations. These artifacts can bias downstream quantitative analysis. For example, they may cause inaccurate tissue boundary delineation or erroneous volumetric estimation, ultimately reducing clinical trust in synthesized images. These limitations highlight the need for generative models that are not only visually accurate but also spatially reliable and anatomically consistent. To address this issue, we propose a reliability-aware diffusion framework (ReDiff) that improves synthesis robustness at both the sampling and post-generation stages. Specifically, we introduce a reliability-guided sampling strategy to suppress unreliable responses during the denoising process. We further develop an uncertainty-aware multi-candidate selection scheme to enhance the reliability of the final prediction. Experiments on multi-center MRI datasets demonstrate improved structural fidelity and reduced artifacts compared with state-of-the-art methods.
Chinese Translation
超低场到高场MRI合成已成为在硬件和采集限制下提高图像质量的一种经济有效的策略,尤其是在高场扫描仪的获取受限或不切实际的情况下。尽管扩散模型近期取得了进展,但基于扩散的方法往往难以平衡细节恢复与结构保真度。特别是在结构模糊区域中,高分辨率细节的无控制生成可能引入解剖不一致的模式,例如虚假的边缘或人工纹理变化。这些伪影可能会偏倚后续的定量分析。例如,它们可能导致组织边界的错误描绘或体积估计的错误,最终降低临床对合成图像的信任。这些局限性突显了需要不仅在视觉上准确,而且在空间上可靠和解剖上一致的生成模型。为了解决这个问题,我们提出了一种可靠性感知的扩散框架(ReDiff),在采样和生成后阶段提高合成的鲁棒性。具体而言,我们引入了一种可靠性引导的采样策略,以抑制去噪过程中不可靠的响应。我们进一步开发了一种不确定性感知的多候选选择方案,以增强最终预测的可靠性。在多中心MRI数据集上的实验表明,与最先进的方法相比,结构保真度有所提高,伪影减少。
cs.CV / 15 / 2603.11346
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
学习辅助:基于物理的人际控制通过多智能体强化学习
Abstract
Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.
Chinese Translation
类人机器人在日常服务和护理应用中具有巨大的变革潜力。尽管最近在物理引擎中的一般运动跟踪(GMT)方面的进展使得虚拟角色和类人机器人能够重现广泛的人类运动,但这些行为主要限于无接触的社交互动或孤立的运动。相比之下,辅助场景需要对人类伙伴的持续关注以及对其不断变化的姿势和动态的快速适应。在本文中,我们将紧密互动、力交换的人际运动序列的模仿形式化为一个多智能体强化学习问题。我们在物理模拟器中共同训练支持者(助手)智能体和接收者智能体的伙伴感知策略,以跟踪辅助运动参考。为了使这个问题可处理,我们引入了一种伙伴策略初始化方案,该方案从单人运动跟踪控制器中转移先验知识,大大改善了探索效果。我们进一步提出了动态参考重定向和促进接触的奖励,这些方法将助手的参考运动适应于接收者的实时姿势,并鼓励物理上有意义的支持。我们展示了AssistMimic是第一种能够成功跟踪辅助交互运动的算法,且在已建立的基准测试中表现出色,证明了多智能体强化学习在物理基础和社会意识的类人控制中的优势。
cs.CV / 16 / 2603.11380
DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding
DriveXQA:用于不良驾驶场景理解的跨模态视觉问答
Abstract
Fusing sensors with complementary modalities is crucial for maintaining a stable and comprehensive understanding of abnormal driving scenes. However, Multimodal Large Language Models (MLLMs) are underexplored for leveraging multi-sensor information to understand adverse driving scenarios in autonomous vehicles. To address this gap, we propose the DriveXQA, a multimodal dataset for autonomous driving VQA. In addition to four visual modalities, five sensor failure cases, and five weather conditions, it includes $102,505$ QA pairs categorized into three types: global scene level, allocentric level, and ego-vehicle centric level. Since no existing MLLM framework adopts multiple complementary visual modalities as input, we design MVX-LLM, a token-efficient architecture with a Dual Cross-Attention (DCA) projector that fuses the modalities to alleviate information redundancy. Experiments demonstrate that our DCA achieves improved performance under challenging conditions such as foggy (GPTScore: $53.5$ vs. $25.1$ for the baseline). The established dataset and source code will be made publicly available.
Chinese Translation
融合具有互补模态的传感器对于维持对异常驾驶场景的稳定和全面理解至关重要。然而,针对利用多传感器信息理解自动驾驶车辆中的不良驾驶场景,现有的多模态大语言模型(MLLMs)尚未得到充分探索。为了解决这一问题,我们提出了DriveXQA,一个用于自动驾驶视觉问答(VQA)的多模态数据集。该数据集除了包含四种视觉模态、五种传感器故障案例和五种天气条件外,还包括$102,505$个问答对,分为三类:全局场景级、分配中心级和自车中心级。由于现有的MLLM框架没有采用多种互补视觉模态作为输入,我们设计了MVX-LLM,这是一种高效的令牌架构,配备了双重交叉注意力(DCA)投影器,融合这些模态以减轻信息冗余。实验表明,在雾天等挑战性条件下,我们的DCA实现了性能的提升(GPTScore: $53.5$ 对比基线的 $25.1$)。建立的数据集和源代码将公开发布。
cs.CV / 17 / 2603.11389
High-Precision 6DOF Pose Estimation via Global Phase Retrieval in Fringe Projection Profilometry for 3D Mapping
通过全局相位恢复实现高精度六自由度位姿估计:用于三维映射的条纹投影轮廓测量
Abstract
Digital fringe projection (DFP) enables micrometer-level 3D reconstruction, yet extending it to large-scale mapping remains challenging because six-degree-of-freedom pose estimation often cannot match the reconstruction's precision. Conventional iterative closest point (ICP) registration becomes inefficient on multi-million-point clouds and typically relies on downsampling or feature-based selection, which can reduce local detail and degrade pose precision. Drift-correction methods improve long-term consistency but do not resolve sampling sensitivity in dense DFP point clouds.We propose a high-precision pose estimation method that augments a moving DFP system with a fixed, intrinsically calibrated global projector. Using the global projector's phase-derived pixel constraints and a PnP-style reprojection objective, the method estimates the DFP system pose in a fixed reference frame without relying on deterministic feature extraction, and we experimentally demonstrate sampling invariance under coordinate-preserving subsampling. Experiments demonstrate sub-millimeter pose accuracy against a reference with quantified uncertainty bounds, high repeatability under aggressive subsampling, robust operation on homogeneous surfaces and low-overlap views, and reduced error accumulation when used to correct ICP-based trajectories. The method extends DFP toward accurate 3D mapping in quasi-static scenarios such as inspection and metrology, with the trade-off of time-multiplexed acquisition for the additional projector measurements.
Chinese Translation
数字条纹投影(DFP)能够实现微米级的三维重建,但将其扩展到大规模映射仍然面临挑战,因为六自由度位姿估计往往无法与重建精度相匹配。传统的迭代最近点(ICP)配准在数百万点云上变得低效,通常依赖于下采样或基于特征的选择,这可能会减少局部细节并降低位姿精度。漂移校正方法提高了长期一致性,但并未解决密集DFP点云中的采样敏感性。我们提出了一种高精度位姿估计方法,该方法通过固定的、内在标定的全局投影仪增强移动DFP系统。该方法利用全局投影仪的相位导出的像素约束和PnP风格的重投影目标,在不依赖于确定性特征提取的情况下估计DFP系统在固定参考框架中的位姿,并通过实验验证了在坐标保持子采样下的采样不变性。实验结果表明,与具有量化不确定性边界的参考相比,该方法实现了亚毫米级的位姿精度,在激进下采样下具有高重复性,在均匀表面和低重叠视图下表现出强大的操作能力,并在用于校正基于ICP的轨迹时减少了误差累积。该方法将DFP扩展到准静态场景下的准确三维映射,如检测和计量,权衡了时间复用采集与额外投影仪测量之间的关系。
cs.CV / 18 / 2603.11403
DeepHistoViT: An Interpretable Vision Transformer Framework for Histopathological Cancer Classification
DeepHistoViT:一种可解释的视觉变换器框架用于组织病理学癌症分类
Abstract
Histopathology remains the gold standard for cancer diagnosis because it provides detailed cellular-level assessment of tissue morphology. However, manual histopathological examination is time-consuming, labour-intensive, and subject to inter-observer variability, creating a demand for reliable computer-assisted diagnostic tools. Recent advances in deep learning, particularly transformer-based architectures, have shown strong potential for modelling complex spatial dependencies in medical images. In this work, we propose DeepHistoViT, a transformer-based framework for automated classification of histopathological images. The model employs a customized Vision Transformer architecture with an integrated attention mechanism designed to capture fine-grained cellular structures while improving interpretability through attention-based localization of diagnostically relevant regions. The framework is evaluated on three publicly available histopathology datasets covering lung cancer, colon cancer, and acute lymphoblastic leukaemia. Experimental results demonstrate state-of-the-art performance across all datasets, with classification accuracy, precision, recall, F1-score, and ROC-AUC reaching 100 percent on the lung and colon cancer datasets, and 99.85 percent, 99.84 percent, 99.86 percent, 99.85 percent, and 99.99 percent respectively on the acute lymphoblastic leukaemia dataset. All performance metrics are reported with 95 percent confidence intervals. These results highlight the effectiveness of transformer-based architectures for histopathological image analysis and demonstrate the potential of DeepHistoViT as an interpretable computer-assisted diagnostic tool to support pathologists in clinical decision-making.
Chinese Translation
组织病理学仍然是癌症诊断的金标准,因为它提供了对组织形态的详细细胞级评估。然而,手动组织病理学检查耗时、劳动强度大,并且受到观察者间变异性的影响,这就产生了对可靠计算机辅助诊断工具的需求。最近在深度学习方面的进展,特别是基于变换器的架构,显示出在医学图像中建模复杂空间依赖关系的强大潜力。在本研究中,我们提出了DeepHistoViT,一种基于变换器的框架用于自动分类组织病理学图像。该模型采用定制的视觉变换器架构,集成了注意力机制,旨在捕捉细粒度的细胞结构,同时通过基于注意力的诊断相关区域定位提高可解释性。该框架在三个公开可用的组织病理学数据集上进行了评估,涵盖肺癌、结肠癌和急性淋巴细胞白血病。实验结果显示在所有数据集上均达到最先进的性能,其中肺癌和结肠癌数据集的分类准确率、精确度、召回率、F1-score和ROC-AUC均达到100%,而急性淋巴细胞白血病数据集的相应指标分别为99.85%、99.84%、99.86%、99.85%和99.99%。所有性能指标均报告了95%的置信区间。这些结果突显了基于变换器的架构在组织病理学图像分析中的有效性,并展示了DeepHistoViT作为可解释的计算机辅助诊断工具在支持病理学家临床决策中的潜力。
cs.CV / 19 / 2603.11410
Seeing Isn't Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary
看并不等于定向:一个基于认知的基准揭示了多模态大语言模型中的系统性定向失败
Abstract
Humans learn object orientation progressively, from recognizing which way an object faces, to mentally rotating it, to reasoning about orientations between objects. Current vision-language benchmarks largely conflate orientation with position and general scene understanding. We introduce Discriminative Orientation Reasoning Intelligence (DORI), a cognitively grounded hierarchical benchmark that makes object orientation the primary target. Inspired by stages of human orientation cognition, DORI decomposes orientation into four dimensions, each evaluated at coarse (categorical) and granular (metric) levels. Composed from 13,652 images across 14 sources, DORI provides 33,656 multiple-choice questions covering 67 object categories in real-world and synthetic settings. Its coarse-to-granular design isolates orientation from confounds such as object recognition difficulty, scene clutter, and linguistic ambiguity via bounding-box isolation, standardized spatial reference frames, and structured prompts. Evaluating 24 state-of-the-art vision-language models shows a clear pattern: models that perform well on general spatial benchmarks are near-random on object-centric orientation tasks. The best models reach only 54.2% on coarse and 45.0% on granular judgments, with largest failures on compound rotations and shifts in inter-object reference frames. Large coarse-to-granular gaps reveal reliance on categorical heuristics rather than geometric reasoning, a limitation hidden by existing benchmarks. These results identify orientation understanding as an unsolved challenge for multimodal systems, with implications for robotic manipulation, 3D scene reconstruction, and human-AI interaction.
Chinese Translation
人类逐步学习物体的定向,从识别物体的朝向,到进行心理旋转,再到推理物体之间的定向。目前的视觉-语言基准在很大程度上将定向与位置和一般场景理解混为一谈。我们引入了区分性定向推理智能(Discriminative Orientation Reasoning Intelligence, DORI),这是一个基于认知的分层基准,将物体定向作为主要目标。DORI受到人类定向认知阶段的启发,将定向分解为四个维度,每个维度在粗略(分类)和细粒度(度量)水平上进行评估。DORI由来自14个来源的13,652张图像组成,提供了33,656个多项选择题,涵盖67个物体类别,涉及现实世界和合成环境。其粗到细的设计通过边界框隔离、标准化空间参考框架和结构化提示,将定向与物体识别难度、场景杂乱和语言模糊等混淆因素隔离开来。对24个最先进的视觉-语言模型的评估显示出明显的模式:在一般空间基准上表现良好的模型在以物体为中心的定向任务上表现接近随机。表现最好的模型在粗略判断上仅达到54.2%,在细粒度判断上仅达到45.0%,在复合旋转和物体间参考框架的变化上表现最差。粗到细的差距揭示了模型依赖于分类启发式而非几何推理,这一局限性被现有基准所掩盖。这些结果表明,定向理解是多模态系统面临的一个未解决的挑战,对机器人操控、3D场景重建和人机交互具有重要影响。
cs.CV / 20 / 2603.11417
Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations
端到端自动驾驶中的零-shot 跨城市泛化:自监督与监督表示的比较
Abstract
End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.
Chinese Translation
端到端自动驾驶模型通常在多城市数据集上使用监督的 ImageNet 预训练骨干网络进行训练,但它们对未见城市的泛化能力仍然很大程度上未被检验。当训练和评估数据在地理上混合时,模型可能会隐式依赖于城市特定的线索,从而掩盖在向新位置泛化时可能发生的失败模式。在本研究中,我们探讨了端到端轨迹规划中的零-shot 跨城市泛化,并询问自监督视觉表示是否能改善跨城市的迁移。我们通过将自监督骨干网络(I-JEPA、DINOv2 和 MAE)整合到规划框架中进行全面研究。我们在开放循环设置下对 nuScenes 进行严格的地理拆分评估,并在闭环评估协议下对 NAVSIM 进行评估。我们的实验揭示了在不同道路拓扑和驾驶习惯的城市之间迁移依赖于传统监督骨干网络的模型时,存在显著的泛化差距,特别是在从右侧驾驶环境迁移到左侧驾驶环境时。自监督表示学习减少了这一差距。在开放循环评估中,监督骨干网络在从波士顿迁移到新加坡时表现出严重的膨胀(L2 位移比为 9.77 倍,碰撞比为 19.43 倍),而领域特定的自监督预训练将其分别减少到 1.20 倍和 0.75 倍。在闭环评估中,自监督预训练使所有单城市训练城市的 PDMS 提高了最多 4%。这些结果表明,表示学习对跨城市规划的鲁棒性有着重要影响,并确立了零-shot 地理迁移作为评估端到端自动驾驶系统的必要测试。
cs.CV / 21 / 2603.11421
ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation
ShotVerse:推动基于文本的多镜头视频创作中的电影摄像机控制
Abstract
Text-driven video generation has democratized film creation, but camera control in cinematic multi-shot scenarios remains a significant block. Implicit textual prompts lack precision, while explicit trajectory conditioning imposes prohibitive manual overhead and often triggers execution failures in current models. To overcome this bottleneck, we propose a data-centric paradigm shift, positing that aligned (Caption, Trajectory, Video) triplets form an inherent joint distribution that can connect automated plotting and precise execution. Guided by this insight, we present ShotVerse, a "Plan-then-Control" framework that decouples generation into two collaborative agents: a VLM (Vision-Language Model)-based Planner that leverages spatial priors to obtain cinematic, globally aligned trajectories from text, and a Controller that renders these trajectories into multi-shot video content via a camera adapter. Central to our approach is the construction of a data foundation: we design an automated multi-shot camera calibration pipeline aligns disjoint single-shot trajectories into a unified global coordinate system. This facilitates the curation of ShotVerse-Bench, a high-fidelity cinematic dataset with a three-track evaluation protocol that serves as the bedrock for our framework. Extensive experiments demonstrate that ShotVerse effectively bridges the gap between unreliable textual control and labor-intensive manual plotting, achieving superior cinematic aesthetics and generating multi-shot videos that are both camera-accurate and cross-shot consistent.
Chinese Translation
基于文本的视频生成使电影创作变得更加民主化,但在电影多镜头场景中的摄像机控制仍然是一个重大障碍。隐式文本提示缺乏精确性,而显式轨迹条件则带来了高昂的手动开销,并且常常导致当前模型的执行失败。为了解决这一瓶颈,我们提出了一种以数据为中心的范式转变,认为对齐的(Caption, Trajectory, Video)三元组形成了一个固有的联合分布,可以连接自动绘制和精确执行。在这一洞察的指导下,我们提出了ShotVerse,一个“先规划再控制”的框架,将生成过程解耦为两个协作代理:一个基于VLM(视觉-语言模型)的规划器,利用空间先验从文本中获取电影化的、全局对齐的轨迹;一个控制器,通过摄像机适配器将这些轨迹渲染成多镜头视频内容。我们方法的核心是构建一个数据基础:我们设计了一条自动化的多镜头摄像机标定管道,将不相交的单镜头轨迹对齐到统一的全球坐标系统。这促进了ShotVerse-Bench的策划,这是一个高保真度的电影数据集,具有三轨评估协议,为我们的框架提供了基础。大量实验表明,ShotVerse有效地弥合了不可靠的文本控制与劳动密集型手动绘制之间的差距,达到了卓越的电影美学,并生成了既准确又跨镜头一致的多镜头视频。
cs.CV / 22 / 2603.11423
Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding
超越单样本:可靠的多样本蒸馏用于视频理解
Abstract
Traditional black-box distillation for Large Vision-Language Models (LVLMs) typically relies on a single teacher response per input, which often yields high-variance responses and format inconsistencies in multimodal or temporal scenarios. To mitigate this unreliable supervision, we propose R-MSD (Reliable Multi-Sample Distillation), a framework that explicitly models teacher sampling variance to enhance distillation stability. Rather than relying on a single teacher response, our approach leverages a task-adaptive teacher pool to provide robust supervision tailored to both closed-ended and open-ended reasoning. By integrating quality-aware signal matching with an adversarial distillation objective, our approach effectively filters teacher noise while maximizing knowledge transfer. Extensive evaluations across comprehensive video understanding benchmarks demonstrate that R-MSD consistently outperforms single sample distillation methods. We additionally include an original SFT+RL 4B baseline under the same training budget, which shows only marginal gains, while our method achieves significant improvements. With a 4B student model, our approach delivers gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%).
Chinese Translation
传统的黑箱蒸馏方法对于大型视觉语言模型(Large Vision-Language Models, LVLMs)通常依赖于每个输入的单一教师响应,这往往导致高方差的响应和多模态或时间场景中的格式不一致。为了减轻这种不可靠的监督,我们提出了R-MSD(可靠的多样本蒸馏)框架,该框架明确建模教师采样方差,以增强蒸馏的稳定性。我们的方法不再依赖于单一教师响应,而是利用任务自适应的教师池提供针对封闭式和开放式推理的强健监督。通过将质量感知信号匹配与对抗蒸馏目标相结合,我们的方法有效过滤教师噪声,同时最大化知识转移。在全面的视频理解基准测试中,广泛评估表明R-MSD始终优于单样本蒸馏方法。我们还在相同的训练预算下包含了一个原始的SFT+RL 4B基线,该基线仅显示出边际增益,而我们的方法则实现了显著的改进。使用4B学生模型,我们的方法在VideoMME上提升了1.5%,在Video-MMMU上提升了3.2%,在MathVerse上提升了3.6%。
cs.CV / 23 / 2603.11439
Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
保持在你的领域内:具有重叠抑制损失的角色特定查询用于密集视频字幕生成
Abstract
Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.
Chinese Translation
密集视频字幕生成(DVC)是一项具有挑战性的多模态任务,涉及在视频中对多个事件进行时间上的定位并用自然语言进行描述。虽然基于查询的框架能够实现定位和字幕生成的同时端到端处理,但它们对共享查询的依赖往往导致这两项任务之间显著的多任务干扰,以及定位中的时间冗余。在本文中,我们提出利用角色特定查询,将定位和字幕生成分离为独立的组件,使每个组件能够专注于学习其特定角色。随后,我们采用对比对齐来强制相应输出之间的语义一致性,确保分离查询之间的行为一致。此外,我们设计了一种新颖的抑制机制,对查询之间的相互时间重叠进行惩罚,以解决时间冗余问题,指导模型学习不同的、非重叠的事件区域,从而实现更精确的定位。此外,我们引入了一个轻量级模块,捕捉核心事件概念,通过概念级表示进一步增强字幕的语义丰富性。我们通过在主要的DVC基准数据集YouCook2和ActivityNet Captions上进行广泛实验,证明了我们方法的有效性。
cs.CV / 24 / 2603.11441
Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection
实时检测任何物体:从单提示分割到多类别检测
Abstract
Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at https://github.com/mkturkcan/DART.
Chinese Translation
最近在视觉-语言建模方面的进展产生了可提示的检测和分割系统,这些系统在推理时接受任意自然语言查询。在这些系统中,SAM3通过结合ViT-H/14主干网络、跨模态变换器解码和学习的物体查询,实现了最先进的准确性。然而,SAM3每次前向传播仅处理一个文本提示。检测N个类别需要N次独立执行,每次执行都由439M参数的主干网络主导。我们提出了实时检测任何物体(DART),这是一个无训练框架,通过利用结构不变性将SAM3转换为实时多类别检测器:视觉主干网络与类别无关,生成与文本提示无关的图像特征。这使得主干计算可以在所有类别之间共享,将其成本从O(N)降低到O(1)。结合批量多类别解码、仅检测推理和TensorRT FP16部署,这些优化在3个类别下实现了5.6倍的累计加速,在80个类别下扩展至25倍,而无需修改任何模型权重。在COCO val2017(5000张图像,80个类别)上,DART在单个RTX 4080上以15.8 FPS(4个类别,1008x1008)达到了55.8 AP,超过了在数百万个框注释上训练的专用开放词汇检测器。对于极低延迟目标,使用冻结的编码器-解码器进行适配器蒸馏达到了38.7 AP,主干网络延迟为13.9毫秒。代码和模型可在https://github.com/mkturkcan/DART获取。
cs.CV / 25 / 2603.11460
Follow the Saliency: Supervised Saliency for Retrieval-augmented Dense Video Captioning
跟随显著性:用于检索增强的密集视频字幕的监督显著性
Abstract
Existing retrieval-augmented approaches for Dense Video Captioning (DVC) often fail to achieve accurate temporal segmentation aligned with true event boundaries, as they rely on heuristic strategies that overlook ground truth event boundaries. The proposed framework, \textbf{STaRC}, overcomes this limitation by supervising frame-level saliency through a highlight detection module. Note that the highlight detection module is trained on binary labels derived directly from DVC ground truth annotations without the need for additional annotation. We also propose to utilize the saliency scores as a unified temporal signal that drives retrieval via saliency-guided segmentation and informs caption generation through explicit Saliency Prompts injected into the decoder. By enforcing saliency-constrained segmentation, our method produces temporally coherent segments that align closely with actual event transitions, leading to more accurate retrieval and contextually grounded caption generation. We conduct comprehensive evaluations on the YouCook2 and ViTT benchmarks, where STaRC achieves state-of-the-art performance across most of the metrics. Our code is available at https://github.com/ermitaju1/STaRC
Chinese Translation
现有的检索增强密集视频字幕(DVC)方法往往无法实现与真实事件边界对齐的准确时间分割,因为它们依赖于忽略真实事件边界的启发式策略。我们提出的框架 extbf{STaRC}通过高亮检测模块对帧级显著性进行监督,从而克服了这一限制。值得注意的是,高亮检测模块是基于直接来自DVC真实标注的二元标签进行训练的,无需额外的标注。我们还建议利用显著性分数作为统一的时间信号,通过显著性引导的分割驱动检索,并通过注入解码器的显著性提示(Saliency Prompts)来指导字幕生成。通过强制显著性约束的分割,我们的方法生成的时间段与实际事件过渡紧密对齐,从而实现更准确的检索和上下文相关的字幕生成。我们在YouCook2和ViTT基准上进行了全面评估,STaRC在大多数指标上达到了最先进的性能。我们的代码可在https://github.com/ermitaju1/STaRC获取。
cs.CV / 26 / 2603.11481
INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs
INFACT:视频大语言模型中诱发的可信性和事实性幻觉的诊断基准
Abstract
Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.
Chinese Translation
尽管取得了快速进展,视频大语言模型(Video-LLMs)仍然不可靠,主要由于幻觉现象,即输出与视频证据(可信性)或可验证的世界知识(事实性)相矛盾。现有基准对事实性幻觉的覆盖有限,且主要在干净环境中评估模型。我们提出了 extsc{INFACT},这是一个诊断基准,包含9,800个问答实例,具有细粒度的可信性和事实性分类,涵盖真实和合成视频。 extsc{INFACT}在四种模式下评估模型:基础模式(Base,干净)、视觉退化、证据腐蚀和时间干预(针对顺序敏感项目)。在诱发模式下的可靠性通过抵抗率(Resist Rate, RR)和时间敏感性评分(Temporal Sensitivity Score, TSS)进行量化。在14个代表性的视频大语言模型上进行的实验表明,基础模式的更高准确性并不可靠地转化为诱发模式下的更高可靠性,证据腐蚀降低了稳定性,而时间干预导致了最大的退化。值得注意的是,许多开源基准在事实性方面的TSS接近零,表明在顺序敏感问题上存在显著的时间惯性。
cs.CV / 27 / 2603.11492
SPEGC: Continual Test-Time Adaptation via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation
SPEGC:通过语义提示增强图聚类实现医疗图像分割的持续测试时间适应
Abstract
In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code is available at https://github.com/Jwei-Z/SPEGC-for-MIS.
Chinese Translation
在医疗图像分割任务中,由于训练数据与测试数据在数据收集上的差异所造成的领域差距严重阻碍了预训练模型在临床实践中的部署。持续测试时间适应(CTTA)旨在使预训练模型能够适应不断变化的无标签领域,为解决这一问题提供了一种有效的方法。然而,现有的CTTA方法往往依赖于不可靠的监督信号,导致错误累积的自我强化循环,最终导致性能的灾难性下降。为克服这些挑战,我们提出了一种基于语义提示增强图聚类(SPEGC)的CTTA方法用于医疗图像分割。首先,我们设计了一种语义提示特征增强机制,利用解耦的共性和异质性提示池将全局上下文信息注入局部特征,从而减轻其在领域转移下对噪声干扰的敏感性。其次,基于这些增强特征,我们设计了一个可微分的图聚类求解器。该求解器将全局边缘稀疏化重新构建为一个最优运输问题,使其能够以端到端的方式将原始相似度矩阵提炼为精细的高阶结构表示。最后,这种稳健的结构表示被用来指导模型适应,确保预测在聚类级别上的一致性,并动态调整决策边界。大量实验表明,SPEGC在两个医疗图像分割基准测试中优于其他最先进的CTTA方法。源代码可在 https://github.com/Jwei-Z/SPEGC-for-MIS 获取。
cs.CV / 28 / 2603.11493
OrthoEraser: Coupled-Neuron Orthogonal Projection for Concept Erasure
OrthoEraser:用于概念消除的耦合神经元正交投影
Abstract
Text-to-image (T2I) models face significant safety risks from adversarial induction, yet current concept erasure methods often cause collateral damage to benign attributes when suppressing selected neurons entirely. This occurs because sensitive and benign semantics exhibit non-orthogonal superposition, sharing activation subspaces where their respective vectors are inherently entangled. To address this issue, we propose OrthoEraser, which leverages sparse autoencoders (SAE) to achieve high-resolution feature disentanglement and subsequently redefines erasure as an analytical orthogonalization projection that preserves the benign manifold's invariance. OrthoEraser first employs SAE to decompose dense activations and segregate sensitive neurons. It then uses coupled neuron detection to identify non-sensitive features vulnerable to intervention. The key novelty lies in an analytical gradient orthogonalization strategy that projects erasure vectors onto the null space of the coupled neurons. This orthogonally decouples the sensitive concepts from the identified critical benign subspace, effectively preserving non-sensitive semantics. Experimental results on safety demonstrate that OrthoEraser achieves high erasure precision, effectively removing harmful content while preserving the integrity of the generative manifold, and significantly outperforming SOTA baselines. This paper contains results of unsafe models.
Chinese Translation
文本到图像(T2I)模型面临来自对抗性诱导的重大安全风险,然而当前的概念消除方法在完全抑制选定神经元时,往往会对良性属性造成附带损害。这是因为敏感和良性语义表现出非正交叠加,分享激活子空间,其各自的向量本质上是纠缠在一起的。为了解决这个问题,我们提出了OrthoEraser,它利用稀疏自编码器(SAE)实现高分辨率特征解耦,并随后将消除重新定义为一种分析正交化投影,以保持良性流形的不变性。OrthoEraser首先采用SAE分解密集激活并隔离敏感神经元。然后,它使用耦合神经元检测来识别易受干预的非敏感特征。关键的新颖性在于一种分析梯度正交化策略,该策略将消除向量投影到耦合神经元的零空间。这种正交解耦将敏感概念与识别出的关键良性子空间有效分离,从而有效保留非敏感语义。在安全性实验结果中,OrthoEraser实现了高消除精度,有效去除有害内容,同时保持生成流形的完整性,并显著优于现有的最先进基准。这篇论文包含了不安全模型的结果。
cs.CV / 29 / 2603.11498
ActiveFreq: Integrating Active Learning and Frequency Domain Analysis for Interactive Segmentation
ActiveFreq:将主动学习与频域分析相结合的交互式分割方法
Abstract
Interactive segmentation is commonly used in medical image analysis to obtain precise, pixel-level labeling, typically involving iterative user input to correct mislabeled regions. However, existing approaches often fail to fully utilize user knowledge from interactive inputs and achieve comprehensive feature extraction. Specifically, these methods tend to treat all mislabeled regions equally, selecting them randomly for refinement without evaluating each region's potential impact on segmentation quality. Additionally, most models rely solely on spatial domain features, overlooking frequency domain information that could enhance feature extraction and improve performance. To address these limitations, we propose ActiveFreq, a novel interactive segmentation framework that integrates active learning and frequency domain analysis to minimize human intervention while achieving high-quality labeling. ActiveFreq introduces AcSelect, an autonomous module that prioritizes the most informative mislabeled regions, ensuring maximum performance gain from each click. Moreover, we develop FreqFormer, a segmentation backbone incorporating a Fourier transform module to map features from the spatial to the frequency domain, enabling richer feature extraction. Evaluations on the ISIC-2017 and OAI-ZIB datasets demonstrate that ActiveFreq achieves high performance with reduced user interaction, achieving 3.74 NoC@90 on ISIC-2017 and 9.27 NoC@90 on OAI-ZIB, with 23.5% and 12.8% improvements over previous best results, respectively. Under minimal input conditions, such as two clicks, ActiveFreq reaches mIoU scores of 85.29% and 75.76% on ISIC-2017 and OAI-ZIB, highlighting its efficiency and accuracy in interactive medical segmentation.
Chinese Translation
交互式分割在医学图像分析中被广泛应用,以获得精确的像素级标注,通常涉及迭代的用户输入来纠正错误标记的区域。然而,现有的方法往往未能充分利用来自交互输入的用户知识,并实现全面的特征提取。具体而言,这些方法倾向于平等对待所有错误标记的区域,随机选择它们进行细化,而不评估每个区域对分割质量的潜在影响。此外,大多数模型仅依赖空间域特征,忽视了可能增强特征提取和提高性能的频域信息。为了解决这些局限性,我们提出了ActiveFreq,一个新颖的交互式分割框架,集成了主动学习和频域分析,旨在最小化人工干预,同时实现高质量的标注。ActiveFreq引入了AcSelect,一个自主模块,优先选择最具信息量的错误标记区域,确保每次点击都能获得最大的性能提升。此外,我们开发了FreqFormer,一个包含傅里叶变换模块的分割主干,将特征从空间域映射到频域,从而实现更丰富的特征提取。在ISIC-2017和OAI-ZIB数据集上的评估表明,ActiveFreq在减少用户交互的情况下实现了高性能,在ISIC-2017上达到3.74 NoC@90,在OAI-ZIB上达到9.27 NoC@90,分别比之前的最佳结果提高了23.5%和12.8%。在最小输入条件下,例如两个点击,ActiveFreq在ISIC-2017和OAI-ZIB上分别达到了85.29%和75.76%的mIoU分数,突显了其在交互式医学分割中的效率和准确性。
cs.CV / 30 / 2603.11505
Gen-Fab: A Variation-Aware Generative Model for Predicting Fabrication Variations in Nanophotonic Devices
Gen-Fab:一种考虑变异的生成模型,用于预测纳米光子器件中的制造变异
Abstract
Silicon photonic devices often exhibit fabrication-induced variations such as over-etching, underetching, and corner rounding, which can significantly alter device performance. These variations are non-uniform and are influenced by feature size and shape. Accurate digital twins are therefore needed to predict the range of possible fabricated outcomes for a given design. In this paper, we introduce Gen-Fab, a conditional generative adversarial network (cGAN) based on Pix2Pix to predict and model uncertainty in photonic fabrication outcomes. The proposed method takes a design layout (in GDS format) as input and produces diverse high-resolution predictions similar to scanning electron microscope (SEM) images of fabricated devices, capturing the range of process variations at the nanometer scale. To enable one-to-many mapping, we inject a latent noise vector at the model bottleneck. We compare Gen-Fab against three baselines: (1) a deterministic U-Net predictor, (2) an inference-time Monte Carlo Dropout U-Net, and (3) an ensemble of varied U-Nets. Evaluations on an out-of-distribution dataset of fabricated photonic test structures demonstrate that Gen-Fab outperforms all baselines in both accuracy and uncertainty modeling. An additional distribution shift analysis further confirms its strong generalization to unseen fabrication geometries. Gen-Fab achieves the highest intersection-over-union (IoU) score of 89.8%, outperforming the deterministic U-Net (85.3%), the MC-Dropout U-Net (83.4%), and varying U-Nets (85.8%). It also better aligns with the distribution of real fabrication outcomes, achieving lower Kullback-Leibler divergence and Wasserstein distance.
Chinese Translation
硅光子器件常常表现出由制造引起的变异,如过蚀刻、欠蚀刻和角落圆化,这些变异会显著影响器件性能。这些变异是不均匀的,并受到特征尺寸和形状的影响。因此,需要准确的数字双胞胎来预测给定设计的可能制造结果范围。本文介绍了Gen-Fab,一种基于Pix2Pix的条件生成对抗网络(cGAN),用于预测和建模光子制造结果的不确定性。所提出的方法以设计布局(以GDS格式)为输入,生成与制造器件的扫描电子显微镜(SEM)图像相似的多样化高分辨率预测,捕捉纳米尺度的工艺变异范围。为了实现一对多映射,我们在模型瓶颈处注入了潜在噪声向量。我们将Gen-Fab与三个基准进行比较:(1)确定性U-Net预测器,(2)推理时的蒙特卡洛丢弃U-Net,以及(3)多种U-Net的集成。在一个超出分布的数据集上对制造光子测试结构的评估表明,Gen-Fab在准确性和不确定性建模方面均优于所有基准。额外的分布转移分析进一步确认了其对未见制造几何形状的强泛化能力。Gen-Fab实现了最高的交并比(IoU)得分89.8%,超越了确定性U-Net(85.3%)、MC-Dropout U-Net(83.4%)和不同U-Net(85.8%)。它还更好地与真实制造结果的分布对齐,实现了更低的Kullback-Leibler散度和Wasserstein距离。
cs.CV / 31 / 2603.11509
Manifold-Optimal Guidance: A Unified Riemannian Control View of Diffusion Guidance
流形最优引导:扩散引导的统一黎曼控制视角
Abstract
Classifier-Free Guidance (CFG) serves as the de facto control mechanism for conditional diffusion, yet high guidance scales notoriously induce oversaturation, texture artifacts, and structural collapse. We attribute this failure to a geometric mismatch: standard CFG performs Euclidean extrapolation in ambient space, inadvertently driving sampling trajectories off the high-density data manifold. To resolve this, we present Manifold-Optimal Guidance (MOG), a framework that reformulates guidance as a local optimal control problem. MOG yields a closed-form, geometry-aware Riemannian update that corrects off-manifold drift without requiring retraining. Leveraging this perspective, we further introduce Auto-MOG, a dynamic energy-balancing schedule that adaptively calibrates guidance strength, effectively eliminating the need for manual hyperparameter tuning. Extensive validation demonstrates that MOG yields superior fidelity and alignment compared to baselines, with virtually no added computational overhead.
Chinese Translation
无分类器引导(Classifier-Free Guidance, CFG)作为条件扩散的实际控制机制,然而高引导尺度常常导致过饱和、纹理伪影和结构崩溃。我们将这一失败归因于几何不匹配:标准CFG在环境空间中执行欧几里得外推,意外地使采样轨迹偏离高密度数据流形。为了解决这个问题,我们提出了流形最优引导(Manifold-Optimal Guidance, MOG),一个将引导重新表述为局部最优控制问题的框架。MOG产生一个闭式解,具有几何感知的黎曼更新,能够在不需要重新训练的情况下纠正偏离流形的漂移。基于这一视角,我们进一步引入了自动MOG(Auto-MOG),一个动态能量平衡调度,能够自适应地校准引导强度,有效消除手动超参数调优的需求。广泛的验证表明,MOG在保真度和对齐性方面优于基线,几乎没有增加计算开销。
cs.CV / 32 / 2603.11520
FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval
FBCIR:平衡组合图像检索中的跨模态关注
Abstract
Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.
Chinese Translation
组合图像检索(CIR)需要多模态模型共同推理视觉内容和文本-图像输入对中呈现的语义修改。尽管当前的CIR模型在常见基准案例上表现出色,但在更具挑战性的场景中,当负样本与查询图像或文本在语义上对齐时,它们的准确性往往会下降。本文将这种下降归因于关注不平衡,即模型过度关注一种模态而忽视另一种模态。为了验证这一说法,我们提出了FBCIR,一种多模态关注解释方法,旨在识别对模型检索决策最关键的视觉和文本输入成分。使用FBCIR,我们报告现有CIR模型中普遍存在关注不平衡,尤其是在困难负样本设置下。基于这些分析,我们进一步提出了一种CIR数据增强工作流程,该工作流程为现有CIR数据集提供了经过精心策划的困难负样本,以促进平衡的跨模态推理。在多个CIR模型上的广泛实验表明,所提出的增强方法在挑战性案例中始终提高了性能,同时保持了它们在标准基准上的能力。我们的解释方法和数据增强工作流程共同为CIR模型的诊断和鲁棒性改进提供了新的视角。
cs.CV / 33 / 2603.11521
EReCu: Pseudo-label Evolution Fusion and Refinement with Multi-Cue Learning for Unsupervised Camouflage Detection
EReCu:基于多线索学习的伪标签演化融合与精炼用于无监督伪装检测
Abstract
Unsupervised Camouflaged Object Detection (UCOD) remains a challenging task due to the high intrinsic similarity between target objects and their surroundings, as well as the reliance on noisy pseudo-labels that hinder fine-grained texture learning. While existing refinement strategies aim to alleviate label noise, they often overlook intrinsic perceptual cues, leading to boundary overflow and structural ambiguity. In contrast, learning without pseudo-label guidance yields coarse features with significant detail loss. To address these issues, we propose a unified UCOD framework that enhances both the reliability of pseudo-labels and the fidelity of features. Our approach introduces the Multi-Cue Native Perception module, which extracts intrinsic visual priors by integrating low-level texture cues with mid-level semantics, enabling precise alignment between masks and native object information. Additionally, Pseudo-Label Evolution Fusion intelligently refines labels through teacher-student interaction and utilizes depthwise separable convolution for efficient semantic denoising. It also incorporates Spectral Tensor Attention Fusion to effectively balance semantic and structural information through compact spectral aggregation across multi-layer attention maps. Finally, Local Pseudo-Label Refinement plays a pivotal role in local detail optimization by leveraging attention diversity to restore fine textures and enhance boundary fidelity. Extensive experiments on multiple UCOD datasets demonstrate that our method achieves state-of-the-art performance, characterized by superior detail perception, robust boundary alignment, and strong generalization under complex camouflage scenarios.
Chinese Translation
无监督伪装物体检测(UCOD)仍然是一项具有挑战性的任务,因为目标物体与其周围环境之间存在高度的内在相似性,并且依赖于嘈杂的伪标签,这阻碍了细粒度纹理学习。尽管现有的精炼策略旨在减轻标签噪声,但它们往往忽视了内在的感知线索,导致边界溢出和结构模糊。相比之下,在没有伪标签指导的情况下进行学习会产生粗糙的特征,且细节损失显著。为了解决这些问题,我们提出了一个统一的UCOD框架,增强了伪标签的可靠性和特征的保真度。我们的方法引入了多线索原生感知模块,通过将低级纹理线索与中级语义相结合,提取内在视觉先验,从而实现掩膜与原生物体信息之间的精确对齐。此外,伪标签演化融合通过教师-学生互动智能地精炼标签,并利用深度可分离卷积进行高效的语义去噪。它还结合了光谱张量注意力融合,通过在多层注意力图中进行紧凑的光谱聚合,有效平衡语义和结构信息。最后,局部伪标签精炼在局部细节优化中发挥关键作用,通过利用注意力多样性恢复细腻纹理并增强边界保真度。在多个UCOD数据集上的广泛实验表明,我们的方法实现了最先进的性能,具有卓越的细节感知、稳健的边界对齐能力以及在复杂伪装场景下的强泛化能力。
cs.CV / 34 / 2603.11525
MDS-VQA: Model-Informed Data Selection for Video Quality Assessment
MDS-VQA:基于模型的信息化数据选择用于视频质量评估
Abstract
Learning-based video quality assessment (VQA) has advanced rapidly, yet progress is increasingly constrained by a disconnect between model design and dataset curation. Model-centric approaches often iterate on fixed benchmarks, while data-centric efforts collect new human labels without systematically targeting the weaknesses of existing VQA models. Here, we describe MDS-VQA, a model-informed data selection mechanism for curating unlabeled videos that are both difficult for the base VQA model and diverse in content. Difficulty is estimated by a failure predictor trained with a ranking objective, and diversity is measured using deep semantic video features, with a greedy procedure balancing the two under a constrained labeling budget. Experiments across multiple VQA datasets and models demonstrate that MDS-VQA identifies diverse, challenging samples that are particularly informative for active fine-tuning. With only a 5% selected subset per target domain, the fine-tuned model improves mean SRCC from 0.651 to 0.722 and achieves the top gMAD rank, indicating strong adaptation and generalization.
Chinese Translation
基于学习的视频质量评估(VQA)技术发展迅速,但进展越来越受到模型设计与数据集策划之间脱节的限制。以模型为中心的方法通常在固定基准上进行迭代,而以数据为中心的努力则收集新的人工标签,却未能系统性地针对现有VQA模型的弱点。在此,我们描述了MDS-VQA,一种基于模型的信息化数据选择机制,用于策划那些对基础VQA模型既具有挑战性又内容多样的未标记视频。通过训练一个以排名为目标的失败预测器来估计困难程度,并使用深度语义视频特征来衡量多样性,采用贪婪的程序在有限的标注预算下平衡这两者。对多个VQA数据集和模型的实验表明,MDS-VQA能够识别出多样且具有挑战性的样本,这些样本对于主动微调尤其具有信息价值。在每个目标领域仅选择5%的子集的情况下,微调后的模型将平均SRCC从0.651提高到0.722,并获得了最高的gMAD排名,表明其具有强大的适应性和泛化能力。
cs.CV / 35 / 2603.11531
Mobile-GS: Real-time Gaussian Splatting for Mobile Devices
移动-GS:用于移动设备的实时高斯点云渲染
Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-quality rendering across a wide range of applications.However, its high computational demands and large storage costs pose significant challenges for deployment on mobile devices. In this work, we propose a mobile-tailored real-time Gaussian Splatting method, dubbed Mobile-GS, enabling efficient inference of Gaussian Splatting on edge devices. Specifically, we first identify alpha blending as the primary computational bottleneck, since it relies on the time-consuming Gaussian depth sorting process. To solve this issue, we propose a depth-aware order-independent rendering scheme that eliminates the need for sorting, thereby substantially accelerating rendering. Although this order-independent rendering improves rendering speed, it may introduce transparency artifacts in regions with overlapping geometry due to the scarcity of rendering order. To address this problem, we propose a neural view-dependent enhancement strategy, enabling more accurate modeling of view-dependent effects conditioned on viewing direction, 3D Gaussian geometry, and appearance attributes. In this way, Mobile-GS can achieve both high-quality and real-time rendering. Furthermore, to facilitate deployment on memory-constrained mobile platforms, we also introduce first-order spherical harmonics distillation, a neural vector quantization technique, and a contribution-based pruning strategy to reduce the number of Gaussian primitives and compress the 3D Gaussian representation with the assistance of neural networks. Extensive experiments demonstrate that our proposed Mobile-GS achieves real-time rendering and compact model size while preserving high visual quality, making it well-suited for mobile applications.
Chinese Translation
三维高斯点云渲染(3D Gaussian Splatting, 3DGS)已成为高质量渲染的强大表示方法,广泛应用于多种场景。然而,其高计算需求和大存储成本对移动设备的部署构成了重大挑战。在本研究中,我们提出了一种针对移动设备优化的实时高斯点云渲染方法,称为移动-GS(Mobile-GS),使得在边缘设备上高效推理高斯点云成为可能。具体而言,我们首先识别出α混合(alpha blending)作为主要的计算瓶颈,因为它依赖于耗时的高斯深度排序过程。为了解决这个问题,我们提出了一种深度感知的无序独立渲染方案,消除了排序的需求,从而显著加速了渲染过程。尽管这种无序独立渲染提高了渲染速度,但由于渲染顺序的稀缺性,它可能在重叠几何体的区域引入透明度伪影。为了解决这一问题,我们提出了一种神经视图依赖增强策略,使得能够更准确地建模视图依赖效应,该策略依赖于观察方向、三维高斯几何体和外观属性。通过这种方式,移动-GS能够实现高质量和实时渲染。此外,为了便于在内存受限的移动平台上部署,我们还引入了一阶球谐蒸馏(first-order spherical harmonics distillation)、一种神经向量量化技术,以及基于贡献的剪枝策略,以减少高斯原语的数量,并在神经网络的帮助下压缩三维高斯表示。大量实验表明,我们提出的移动-GS在保持高视觉质量的同时,实现了实时渲染和紧凑模型大小,使其非常适合移动应用。
cs.CV / 36 / 2603.11534
Risk-Controllable Multi-View Diffusion for Driving Scenario Generation
可控风险的多视角扩散驱动场景生成
Abstract
Generating safety-critical driving scenarios is crucial for evaluating and improving autonomous driving systems, but long-tail risky situations are rarely observed in real-world data and difficult to specify through manual scenario design. Existing generative approaches typically treat risk as an after-the-fact label and struggle to maintain geometric consistency in multi-view driving scenes. We present RiskMV-DPO, a general and systematic pipeline for physically-informed, risk-controllable multi-view scenario generation. By integrating target risk levels with physically-grounded risk modeling, we autonomously synthesize diverse and high-stakes dynamic trajectories that serve as explicit geometric anchors for a diffusion-based video generator. To ensure spatial-temporal coherence and geometric fidelity, we introduce a geometry-appearance alignment module and a region-aware direct preference optimization (RA-DPO) strategy with motion-aware masking to focus learning on localized dynamic regions.Experiments on the nuScenes dataset show that RiskMV-DPO can freely generate a wide spectrum of diverse long-tail scenarios while maintaining state-of-the-art visual quality, improving 3D detection mAP from 18.17 to 30.50 and reducing FID to 15.70. Our work shifts the role of world models from passive environment prediction to proactive, risk-controllable synthesis, providing a scalable toolchain for the safety-oriented development of embodied intelligence.
Chinese Translation
生成安全关键的驾驶场景对于评估和改善自动驾驶系统至关重要,但在真实世界数据中,长尾风险情况很少被观察到,并且通过手动场景设计很难指定。现有的生成方法通常将风险视为事后标签,并且在多视角驾驶场景中难以保持几何一致性。我们提出了RiskMV-DPO,这是一个通用且系统化的物理信息驱动的可控风险多视角场景生成管道。通过将目标风险水平与基于物理的风险建模相结合,我们自主合成多样化且高风险的动态轨迹,这些轨迹作为扩散视频生成器的明确几何锚点。为了确保时空一致性和几何保真性,我们引入了几何-外观对齐模块和区域感知直接偏好优化(RA-DPO)策略,并采用运动感知掩码来集中学习于局部动态区域。在nuScenes数据集上的实验表明,RiskMV-DPO能够自由生成广泛的多样化长尾场景,同时保持最先进的视觉质量,将3D检测的mAP从18.17提高到30.50,并将FID降低到15.70。我们的工作将世界模型的角色从被动环境预测转变为主动的、可控风险的合成,为以安全为导向的具身智能发展提供了可扩展的工具链。
cs.CV / 37 / 2603.11542
ReHARK: Refined Hybrid Adaptive RBF Kernels for Robust One-Shot Vision-Language Adaptation
ReHARK:用于稳健的一次性视觉-语言适应的精炼混合自适应RBF核
Abstract
The adaptation of large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks with extremely limited data -- specifically in the one-shot regime -- is often hindered by a significant "Stability-Plasticity" dilemma. While efficient caching mechanisms have been introduced by training-free methods such as Tip-Adapter, these approaches often function as local Nadaraya-Watson estimators. Such estimators are characterized by inherent boundary bias and a lack of global structural regularization. In this paper, ReHARK (Refined Hybrid Adaptive RBF Kernels) is proposed as a synergistic training-free framework that reinterprets few-shot adaptation through global proximal regularization in a Reproducing Kernel Hilbert Space (RKHS). A multistage refinement pipeline is introduced, consisting of: (1) Hybrid Prior Construction, where zero-shot textual knowledge from CLIP and GPT-3 is fused with visual class prototypes to form a robust semantic-visual anchor; (2) Support Set Augmentation (Bridging), where intermediate samples are generated to smooth the transition between visual and textual modalities; (3) Adaptive Distribution Rectification, where test feature statistics are aligned with the augmented support set to mitigate domain shifts; and (4) Multi-Scale RBF Kernels, where an ensemble of kernels is employed to capture complex feature geometries across diverse scales. Superior stability and accuracy are demonstrated through extensive experiments on 11 diverse benchmarks. A new state-of-the-art for one-shot adaptation is established by ReHARK, which achieves an average accuracy of 65.83%, significantly outperforming existing baselines. Code is available at https://github.com/Jahid12012021/ReHARK.
Chinese Translation
大规模视觉-语言模型(VLMs)如CLIP在极其有限的数据下适应下游任务——特别是在一次性学习的情况下——常常受到显著的“稳定性-可塑性”困境的阻碍。虽然通过无训练的方法如Tip-Adapter引入了高效的缓存机制,但这些方法通常作为局部的Nadaraya-Watson估计器运作。这类估计器的特点是固有的边界偏差和缺乏全局结构正则化。在本文中,提出了ReHARK(精炼混合自适应RBF核)作为一个协同的无训练框架,通过在再生核希尔伯特空间(RKHS)中进行全局近端正则化,重新诠释了少样本适应。引入了一个多阶段的精炼管道,包括:(1)混合先验构建,将来自CLIP和GPT-3的零样本文本知识与视觉类别原型融合,以形成稳健的语义-视觉锚;(2)支持集增强(桥接),生成中间样本以平滑视觉和文本模态之间的过渡;(3)自适应分布校正,将测试特征统计与增强的支持集对齐,以减轻领域转移;(4)多尺度RBF核,采用核的集成以捕捉不同尺度下复杂的特征几何。通过在11个多样化基准上的广泛实验,展示了卓越的稳定性和准确性。ReHARK建立了一种新的一次性适应的最先进水平,平均准确率达到65.83%,显著优于现有基准。代码可在https://github.com/Jahid12012021/ReHARK获取。
cs.CV / 38 / 2603.11543
Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting
Mango-GS:利用多帧节点引导的4D高斯点云增强动态场景重建中的时空一致性
Abstract
Reconstructing dynamic 3D scenes with photorealistic detail and strong temporal coherence remains a significant challenge. Existing Gaussian splatting approaches for dynamic scene modeling often rely on per-frame optimization, which can overfit to instantaneous states instead of capturing underlying motion dynamics. To address this, we present Mango-GS, a multi-frame, node-guided framework for high-fidelity 4D reconstruction. Mango-GS leverages a temporal Transformer to model motion dependencies within a short window of frames, producing temporally consistent deformations. For efficiency, temporal modeling is confined to a sparse set of control nodes. Each node is represented by a decoupled canonical position and a latent code, providing a stable semantic anchor for motion propagation and preventing correspondence drift under large motion. Our framework is trained end-to-end, enhanced by an input masking strategy and two multi-frame losses to improve robustness. Extensive experiments demonstrate that Mango-GS achieves state-of-the-art reconstruction quality and real-time rendering speed, enabling high-fidelity reconstruction and interactive rendering of dynamic scenes.
Chinese Translation
重建具有照片级真实感细节和强时序一致性的动态3D场景仍然是一个重大挑战。现有的动态场景建模的高斯点云方法通常依赖于逐帧优化,这可能导致对瞬时状态的过拟合,而无法捕捉潜在的运动动态。为了解决这个问题,我们提出了Mango-GS,一种用于高保真4D重建的多帧节点引导框架。Mango-GS利用时间变换器(temporal Transformer)在短时间帧窗口内建模运动依赖性,从而产生时序一致的变形。为了提高效率,时间建模仅限于一组稀疏的控制节点。每个节点由一个解耦的标准位置和一个潜在编码表示,为运动传播提供了稳定的语义锚点,并防止在大运动下的对应漂移。我们的框架采用端到端训练,结合输入掩蔽策略和两个多帧损失来提高鲁棒性。大量实验表明,Mango-GS实现了最先进的重建质量和实时渲染速度,使得动态场景的高保真重建和交互式渲染成为可能。
cs.CV / 39 / 2603.11550
PCA-Enhanced Probabilistic U-Net for Effective Ambiguous Medical Image Segmentation
基于PCA增强的概率U-Net用于有效的模糊医学图像分割
Abstract
Ambiguous Medical Image Segmentation (AMIS) is significant to address the challenges of inherent uncertainties from image ambiguities, noise, and subjective annotations. Existing conditional variational autoencoder (cVAE)-based methods effectively capture uncertainty but face limitations including redundancy in high-dimensional latent spaces and limited expressiveness of single posterior networks. To overcome these issues, we introduce a novel PCA-Enhanced Probabilistic U-Net (\textbf{PEP U-Net}). Our method effectively incorporates Principal Component Analysis (PCA) for dimensionality reduction in the posterior network to mitigate redundancy and improve computational efficiency. Additionally, we further employ an inverse PCA operation to reconstruct critical information, enhancing the latent space's representational capacity. Compared to conventional generative models, our method preserves the ability to generate diverse segmentation hypotheses while achieving a superior balance between segmentation accuracy and predictive variability, thereby advancing the performance of generative modeling in medical image segmentation.
Chinese Translation
模糊医学图像分割(AMIS)对于应对图像模糊性、噪声和主观标注所带来的固有不确定性挑战具有重要意义。现有的基于条件变分自编码器(cVAE)的方法有效地捕捉了不确定性,但面临高维潜在空间冗余和单一后验网络表达能力有限等局限。为了解决这些问题,我们提出了一种新颖的基于PCA增强的概率U-Net(PEP U-Net)。我们的方法有效地结合了主成分分析(PCA)用于后验网络的降维,以减轻冗余并提高计算效率。此外,我们进一步采用逆PCA操作重构关键信息,增强潜在空间的表征能力。与传统生成模型相比,我们的方法在生成多样化分割假设的同时,能够在分割准确性和预测变异性之间实现更优的平衡,从而推动了医学图像分割中生成建模的性能提升。
cs.CV / 40 / 2603.11554
MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
MANSION:用于长远任务的多层语言到3D场景生成
Abstract
Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.
Chinese Translation
现实世界中的机器人任务通常是长远的,并且往往跨越多个楼层,这要求丰富的空间推理能力。然而,现有的具身基准大多局限于单层室内环境,无法反映现实任务的复杂性。我们提出了MANSION,这是第一个用于生成建筑规模的多层3D环境的语言驱动框架。MANSION考虑了垂直结构的限制,生成现实的、可导航的整栋建筑结构,拥有多样化的人性化场景,从而支持跨楼层长远任务的开发和评估。在此框架基础上,我们发布了MansionWorld,一个包含超过1000个多样化建筑的数据集,涵盖医院到办公室等多种类型,并提供了一个任务语义场景编辑代理,能够使用开放词汇命令定制这些环境以满足特定用户需求。基准测试显示,最先进的代理在我们的设置中表现急剧下降,确立了MANSION作为下一代空间推理和规划的重要测试平台。
cs.CV / 41 / 2603.11556
Enhancing Image Aesthetics with Dual-Conditioned Diffusion Models Guided by Multimodal Perception
通过多模态感知引导的双条件扩散模型增强图像美学
Abstract
Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.
Chinese Translation
图像美学增强旨在识别图像中的美学缺陷并执行相应的编辑操作,这是一项高度挑战性的任务,需要模型具备创造力和美学感知能力。尽管近期图像编辑模型的进展显著提升了其可控性和灵活性,但在增强图像美学方面仍面临困难。主要挑战有两个:首先,遵循具有美学感知的编辑指令是困难的;其次,缺乏具有一致内容但美学特质不同的“完美配对”图像。在本文中,我们提出了双重监督图像美学增强(Dual-supervised Image Aesthetic Enhancement, DIAE),这是一种基于扩散的生成模型,具备多模态美学感知能力。首先,DIAE结合了多模态美学感知(Multimodal Aesthetic Perception, MAP),通过(i)在多个美学属性上采用详细、标准化的美学指令,将模糊的美学指令转化为明确的指导,以及(ii)利用来自文本-图像对的多模态控制信号,这些信号在相同美学属性内保持一致性。其次,为了缓解“完美配对”图像的缺乏,我们收集了一个名为IIAEData的“非完美配对”数据集,该数据集由具有不同美学特质但语义相同的图像组成。为了更好地利用IIAEData在训练过程中的弱匹配特性,我们还引入了一个双分支监督框架,用于弱监督图像美学增强。实验结果表明,DIAE在基准测试中表现优于其他方法,并获得了更高的图像美学评分和图像内容一致性评分。
cs.CV / 42 / 2603.11557
TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision
TornadoNet:基于序数监督的实时建筑损伤检测
Abstract
We present TornadoNet, a comprehensive benchmark for automated street-level building damage assessment evaluating how modern real-time object detection architectures and ordinal-aware supervision strategies perform under realistic post-disaster conditions. TornadoNet provides the first controlled benchmark demonstrating how architectural design and loss formulation jointly influence multi-level damage detection from street-view imagery, delivering methodological insights and deployable tools for disaster response. Using 3,333 high-resolution geotagged images and 8,890 annotated building instances from the 2021 Midwest tornado outbreak, we systematically compare CNN-based detectors from the YOLO family against transformer-based models (RT-DETR) for multi-level damage detection. Models are trained under standardized protocols using a five-level damage classification framework based on IN-CORE damage states, validated through expert cross-annotation. Baseline experiments reveal complementary architectural strengths. CNN-based YOLO models achieve highest detection accuracy and throughput, with larger variants reaching 46.05%
[email protected] at 66-276 FPS on A100 GPUs. Transformer-based RT-DETR models exhibit stronger ordinal consistency, achieving 88.13% Ordinal Top-1 Accuracy and MAOE of 0.65, indicating more reliable severity grading despite lower baseline mAP. To align supervision with the ordered nature of damage severity, we introduce soft ordinal classification targets and evaluate explicit ordinal-distance penalties. RT-DETR trained with calibrated ordinal supervision achieves 44.70%
[email protected], a 4.8 percentage-point improvement, with gains in ordinal metrics (91.15% Ordinal Top-1 Accuracy, MAOE = 0.56). These findings establish that ordinal-aware supervision improves damage severity estimation when aligned with detector architecture. Model & Data: https://github.com/crumeike/TornadoNet
Chinese Translation
我们提出了TornadoNet,这是一个全面的基准,用于自动化街景建筑损伤评估,评估现代实时目标检测架构和序数感知监督策略在现实灾后条件下的表现。TornadoNet提供了第一个受控基准,展示了建筑设计和损失公式如何共同影响来自街景图像的多级损伤检测,提供了方法论见解和可部署的灾害响应工具。我们使用来自2021年中西部龙卷风灾害的3,333张高分辨率地理标记图像和8,890个标注建筑实例,系统地比较了YOLO系列的基于CNN的检测器与基于变换器的模型(RT-DETR)在多级损伤检测中的表现。模型在标准化协议下进行训练,使用基于IN-CORE损伤状态的五级损伤分类框架,并通过专家交叉标注进行验证。基线实验揭示了互补的建筑优势。基于CNN的YOLO模型实现了最高的检测准确率和吞吐量,较大变体在A100 GPU上达到46.05%的
[email protected]和66-276 FPS。基于变换器的RT-DETR模型表现出更强的序数一致性,达到88.13%的序数Top-1准确率和0.65的MAOE,尽管基线mAP较低,但表明其在严重性分级上更可靠。为了使监督与损伤严重性的有序性质对齐,我们引入了软序数分类目标,并评估了显式的序数距离惩罚。经过校准的序数监督训练的RT-DETR模型实现了44.70%的
[email protected],提升了4.8个百分点,并在序数指标上有所提高(91.15%的序数Top-1准确率,MAOE = 0.56)。这些发现表明,当序数感知监督与检测器架构对齐时,可以改善损伤严重性估计。模型与数据: https://github.com/crumeike/TornadoNet
cs.CV / 43 / 2603.11563
SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning
SVLL:分阶段视觉-语言学习用于物理基础的具身任务规划
Abstract
Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.
Chinese Translation
具身任务规划要求视觉-语言模型生成既具视觉基础又在时间上因果一致的动作序列。然而,现有的训练范式面临一个关键的权衡:联合端到端训练往往导致过早的时间绑定,而标准的强化学习方法则遭受优化不稳定性。为了解决这一问题,我们提出了分阶段视觉-语言学习(SVLL),这是一个统一的三阶段框架,用于稳健的物理基础具身规划。在前两个阶段,SVLL将空间基础与时间推理解耦,在引入序列动作历史之前建立稳健的视觉依赖。在最后阶段,我们识别出标准直接偏好优化(DPO)的一个关键限制,即其纯相对性质——仅优化胜利和失败轨迹之间的偏好差距,而忽视了对最佳路径的绝对可能性约束,这往往会导致不安全或虚幻的行为。为了解决这个问题,我们进一步引入了偏差-DPO(Bias-DPO),这是一种新颖的对齐目标,通过显式最大化真实动作的可能性并惩罚过于自信的虚幻行为,向专家轨迹注入归纳偏差。通过将策略锚定在专家流形上并减轻因果错位,SVLL在Bias-DPO的支持下,确保严格遵循环境的可供性,并有效抑制物理上不可能的捷径。最后,在交互式AI2-THOR基准和现实世界机器人部署上的广泛实验表明,SVLL在任务成功率上优于最先进的开源模型(如Qwen2.5-VL-7B)和闭源模型(如GPT-4o,Gemini-2.0-flash),同时显著减少了物理约束的违反。
cs.CV / 44 / 2603.11566
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
R4Det:用于高性能3D物体检测的4D雷达-摄像头融合
Abstract
4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.
Chinese Translation
4D雷达-摄像头感知配置在自动驾驶中越来越重要。然而,现有的融合4D雷达和摄像头数据的3D物体检测方法面临若干挑战。首先,它们的绝对深度估计模块不够稳健和准确,导致3D定位不准确。其次,当自我车辆的姿态缺失或不准确时,它们的时间融合模块性能会显著下降甚至失效。第三,对于一些小物体,稀疏的雷达点云可能完全无法从其表面反射。在这种情况下,检测必须完全依赖视觉单模态先验。为了解决这些限制,我们提出了R4Det,通过全景深度融合模块增强深度估计质量,实现绝对深度和相对深度之间的相互增强。对于时间融合,我们设计了一个不依赖自我车辆姿态的可变形门控时间融合模块。此外,我们构建了一个实例引导动态细化模块,从2D实例引导中提取语义原型。实验表明,R4Det在TJ4DRadSet和VoD数据集上实现了最先进的3D物体检测结果。
cs.CV / 45 / 2603.11593
WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
WeEdit:一个用于文本中心图像编辑的数据集、基准测试和字形引导框架
Abstract
Instruction-based image editing aims to modify specific content within existing images according to user-provided instructions while preserving non-target regions. Beyond traditional object- and style-centric manipulation, text-centric image editing focuses on modifying, translating, or rearranging textual elements embedded within images. However, existing leading models often struggle to execute complex text editing precisely, frequently producing blurry or hallucinated characters. We attribute these failures primarily to the lack of specialized training paradigms tailored for text-centric editing, as well as the absence of large-scale datasets and standardized benchmarks necessary for a closed-loop training and evaluation system. To address these limitations, we present WeEdit, a systematic solution encompassing a scalable data construction pipeline, two benchmarks, and a tailored two-stage training strategy. Specifically, we propose a novel HTML-based automatic editing pipeline, which generates 330K training pairs covering diverse editing operations and 15 languages, accompanied by standardized bilingual and multilingual benchmarks for comprehensive evaluation. On the algorithmic side, we employ glyph-guided supervised fine-tuning to inject explicit spatial and content priors, followed by a multi-objective reinforcement learning stage to align generation with instruction adherence, text clarity, and background preservation. Extensive experiments demonstrate that WeEdit outperforms previous open-source models by a clear margin across diverse editing operations.
Chinese Translation
基于指令的图像编辑旨在根据用户提供的指令修改现有图像中的特定内容,同时保留非目标区域。与传统的对象和风格中心的操作不同,文本中心的图像编辑专注于修改、翻译或重新排列嵌入图像中的文本元素。然而,现有的领先模型在执行复杂的文本编辑时往往表现不佳,常常产生模糊或虚构的字符。我们将这些失败主要归因于缺乏专门针对文本中心编辑的训练范式,以及缺乏大规模数据集和标准化基准测试,这些都是闭环训练和评估系统所必需的。为了解决这些局限性,我们提出了WeEdit,一个系统化的解决方案,包括一个可扩展的数据构建管道、两个基准测试和一个量身定制的两阶段训练策略。具体而言,我们提出了一种新颖的基于HTML的自动编辑管道,生成了涵盖多种编辑操作和15种语言的330K训练对,并配备了标准化的双语和多语种基准测试,以便进行全面评估。在算法方面,我们采用字形引导的监督微调,以注入明确的空间和内容先验,随后进行多目标强化学习阶段,以使生成与指令遵循、文本清晰度和背景保留保持一致。大量实验表明,WeEdit在多种编辑操作中明显优于之前的开源模型。
cs.CV / 46 / 2603.11605
LaMoGen: Language to Motion Generation Through LLM-Guided Symbolic Inference
LaMoGen:通过LLM引导的符号推理实现语言到运动的生成
Abstract
Human motion is highly expressive and naturally aligned with language, yet prevailing methods relying heavily on joint text-motion embeddings struggle to synthesize temporally accurate, detailed motions and often lack explainability. To address these limitations, we introduce LabanLite, a motion representation developed by adapting and extending the Labanotation system. Unlike black-box text-motion embeddings, LabanLite encodes each atomic body-part action (e.g., a single left-foot step) as a discrete Laban symbol paired with a textual template. This abstraction decomposes complex motions into interpretable symbol sequences and body-part instructions, establishing a symbolic link between high-level language and low-level motion trajectories. Building on LabanLite, we present LaMoGen, a Text-to-LabanLite-to-Motion Generation framework that enables large language models (LLMs) to compose motion sequences through symbolic reasoning. The LLM interprets motion patterns, relates them to textual descriptions, and recombines symbols into executable plans, producing motions that are both interpretable and linguistically grounded. To support rigorous evaluation, we introduce a Labanotation-based benchmark with structured description-motion pairs and three metrics that jointly measure text-motion alignment across symbolic, temporal, and harmony dimensions. Experiments demonstrate that LaMoGen establishes a new baseline for both interpretability and controllability, outperforming prior methods on our benchmark and two public datasets. These results highlight the advantages of symbolic reasoning and agent-based design for language-driven motion synthesis.
Chinese Translation
人类运动具有高度的表现力,自然与语言相一致,但现有方法过于依赖联合文本-运动嵌入,难以合成时间上准确、细节丰富的运动,并且往往缺乏可解释性。为了解决这些局限性,我们引入了LabanLite,这是一种通过调整和扩展Labanotation系统而开发的运动表示。与黑箱文本-运动嵌入不同,LabanLite将每个原子身体部位动作(例如,单个左脚步伐)编码为与文本模板配对的离散Laban符号。这种抽象将复杂运动分解为可解释的符号序列和身体部位指令,建立了高层语言与低层运动轨迹之间的符号联系。在LabanLite的基础上,我们提出了LaMoGen,一个文本到LabanLite再到运动生成的框架,使大型语言模型(LLMs)能够通过符号推理组合运动序列。LLM解释运动模式,将其与文本描述关联,并重新组合符号为可执行计划,生成既可解释又与语言相结合的运动。为了支持严格的评估,我们引入了一个基于Labanotation的基准,包含结构化的描述-运动对和三个度量标准,联合测量符号、时间和和谐维度上的文本-运动对齐。实验表明,LaMoGen在可解释性和可控性方面建立了新的基准,在我们的基准和两个公共数据集上优于先前的方法。这些结果突显了符号推理和基于代理的设计在语言驱动运动合成中的优势。
cs.CV / 47 / 2603.11606
Articulat3D: Reconstructing Articulated Digital Twins From Monocular Videos with Geometric and Motion Constraints
Articulat3D:基于几何和运动约束从单目视频重建关节数字双胞胎
Abstract
Building high-fidelity digital twins of articulated objects from visual data remains a central challenge. Existing approaches depend on multi-view captures of the object in discrete, static states, which severely constrains their real-world scalability. In this paper, we introduce Articulat3D, a novel framework that constructs such digital twins from casually captured monocular videos by jointly enforcing explicit 3D geometric and motion constraints. We first propose Motion Prior-Driven Initialization, which leverages 3D point tracks to exploit the low-dimensional structure of articulated motion. By modeling scene dynamics with a compact set of motion bases, we facilitate soft decomposition of the scene into multiple rigidly-moving groups. Building on this initialization, we introduce Geometric and Motion Constraints Refinement, which enforces physically plausible articulation through learnable kinematic primitives parameterized by a joint axis, a pivot point, and per-frame motion scalars, yielding reconstructions that are both geometrically accurate and temporally coherent. Extensive experiments demonstrate that Articulat3D achieves state-of-the-art performance on synthetic benchmarks and real-world casually captured monocular videos, significantly advancing the feasibility of digital twin creation under uncontrolled real-world conditions. Our project page is at https://maxwell-zhao.github.io/Articulat3D.
Chinese Translation
从视觉数据构建高保真度的关节物体数字双胞胎仍然是一个核心挑战。现有的方法依赖于对物体在离散静态状态下的多视角捕捉,这严重限制了它们在现实世界中的可扩展性。本文介绍了Articulat3D,一个新颖的框架,通过联合施加显式的三维几何和运动约束,从随意捕捉的单目视频中构建这些数字双胞胎。我们首先提出了运动先验驱动初始化(Motion Prior-Driven Initialization),利用三维点轨迹来利用关节运动的低维结构。通过用一组紧凑的运动基来建模场景动态,我们促进了将场景软分解为多个刚性移动组。在此初始化基础上,我们引入了几何和运动约束细化(Geometric and Motion Constraints Refinement),通过可学习的运动原语(kinematic primitives)来施加物理上合理的关节运动,这些原语由关节轴、支点和每帧运动标量参数化,从而产生几何上准确且时间上连贯的重建。大量实验表明,Articulat3D在合成基准和现实世界随意捕获的单目视频上实现了最先进的性能,显著推动了在不受控的现实世界条件下创建数字双胞胎的可行性。我们的项目页面位于 https://maxwell-zhao.github.io/Articulat3D。
cs.CV / 48 / 2603.11607
DyWeight: Dynamic Gradient Weighting for Few-Step Diffusion Sampling
DyWeight:用于少步扩散采样的动态梯度加权
Abstract
Diffusion Models (DMs) have achieved state-of-the-art generative performance across multiple modalities, yet their sampling process remains prohibitively slow due to the need for hundreds of function evaluations. Recent progress in multi-step ODE solvers has greatly improved efficiency by reusing historical gradients, but existing methods rely on handcrafted coefficients that fail to adapt to the non-stationary dynamics of diffusion sampling. To address this limitation, we propose Dynamic Gradient Weighting (DyWeight), a lightweight, learning-based multi-step solver that introduces a streamlined implicit coupling paradigm. By relaxing classical numerical constraints, DyWeight learns unconstrained time-varying parameters that adaptively aggregate historical gradients while intrinsically scaling the effective step size. This implicit time calibration accurately aligns the solver's numerical trajectory with the model's internal denoising dynamics under large integration steps, avoiding complex decoupled parameterizations and optimizations. Extensive experiments on CIFAR-10, FFHQ, AFHQv2, ImageNet64, LSUN-Bedroom, Stable Diffusion and FLUX.1-dev demonstrate that DyWeight achieves superior visual fidelity and stability with significantly fewer function evaluations, establishing a new state-of-the-art among efficient diffusion solvers. Code is available at https://github.com/Westlake-AGI-Lab/DyWeight
Chinese Translation
扩散模型(Diffusion Models, DMs)在多个模态上实现了最先进的生成性能,但其采样过程仍然因需要数百次函数评估而显得极其缓慢。最近在多步常微分方程(ODE)求解器方面的进展通过重用历史梯度大大提高了效率,但现有方法依赖于手工设计的系数,无法适应扩散采样的非平稳动态。为了解决这一限制,我们提出了动态梯度加权(Dynamic Gradient Weighting, DyWeight),这是一种轻量级的基于学习的多步求解器,引入了一种简化的隐式耦合范式。通过放宽经典数值约束,DyWeight 学习无约束的时间变化参数,这些参数能够自适应地聚合历史梯度,同时内在地调整有效步长。这种隐式时间校准能够准确地将求解器的数值轨迹与模型的内部去噪动态对齐,避免了复杂的解耦参数化和优化。在 CIFAR-10、FFHQ、AFHQv2、ImageNet64、LSUN-Bedroom、Stable Diffusion 和 FLUX.1-dev 上的广泛实验表明,DyWeight 在显著减少函数评估次数的情况下实现了卓越的视觉保真度和稳定性,确立了高效扩散求解器中的新最先进水平。代码可在 https://github.com/Westlake-AGI-Lab/DyWeight 获取。
cs.CV / 49 / 2603.11616
SemiTooth: a Generalizable Semi-supervised Framework for Multi-Source Tooth Segmentation
SemiTooth:一种可泛化的半监督多源牙齿分割框架
Abstract
With the rapid advancement of artificial intelligence, intelligent dentistry for clinical diagnosis and treatment has become increasingly promising. As the primary clinical dentistry task, tooth structure segmentation for Cone-Beam Computed Tomography (CBCT) has made significant progress in recent years. However, challenges arise from the obtainment difficulty of full-annotated data, and the acquisition variability of multi-source data across different institutions, which have caused low-quality utilization, voxel-level inconsistency, and domain-specific disparity in CBCT slices. Thus, the rational and efficient utilization of multi-source and unlabeled data represents a pivotal problem. In this paper, we propose SemiTooth, a generalizable semi-supervised framework for multi-source tooth segmentation. Specifically, we first compile MS3Toothset, Multi-Source Semi-Supervised Tooth DataSet for clinical dental CBCT, which contains data from three sources with different-level annotations. Then, we design a multi-teacher and multi-student framework, i.e., SemiTooth, which promotes semi-supervised learning for multi-source data. SemiTooth employs distinct student networks that learn from unlabeled data with different sources, supervised by its respective teachers. Furthermore, a Stricter Weighted-Confidence Constraint is introduced for multiple teachers to improve the multi-source accuracy.Extensive experiments are conducted on MS3Toothset to verify the feasibility and superiority of the SemiTooth framework, which achieves SOTA performance on the semi-supervised and multi-source tooth segmentation scenario.
Chinese Translation
随着人工智能的快速发展,智能牙科在临床诊断和治疗中的应用前景愈加广阔。作为主要的临床牙科任务,锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)中的牙齿结构分割近年来取得了显著进展。然而,由于全标注数据获取困难,以及不同机构之间多源数据的获取变异性,导致了CBCT切片的低质量利用、体素级不一致性和领域特定差异。因此,合理高效地利用多源和未标注数据成为一个关键问题。本文提出了SemiTooth,一种可泛化的半监督多源牙齿分割框架。具体而言,我们首先编制了MS3Toothset,即用于临床牙科CBCT的多源半监督牙齿数据集,其中包含来自三个不同来源的数据,具有不同级别的标注。然后,我们设计了一个多教师和多学生框架,即SemiTooth,促进多源数据的半监督学习。SemiTooth采用不同的学生网络,从不同来源的未标注数据中学习,并由各自的教师进行监督。此外,引入了更严格的加权置信约束,以提高多源准确性。我们在MS3Toothset上进行了广泛的实验,以验证SemiTooth框架的可行性和优越性,在半监督和多源牙齿分割场景中实现了最先进的性能。
cs.CV / 50 / 2603.11617
Noise-aware few-shot learning through bi-directional multi-view prompt alignment
通过双向多视角提示对齐的噪声感知少样本学习
Abstract
Vision-language models offer strong few-shot capability through prompt tuning but remain vulnerable to noisy labels, which can corrupt prompts and degrade cross-modal alignment. Existing approaches struggle because they often lack the ability to model fine-grained semantic cues and to adaptively separate clean from noisy signals. To address these challenges, we propose NA-MVP, a framework for Noise-Aware few-shot learning through bi-directional Multi-View Prompt alignment. NA-MVP is built upon a key conceptual shift: robust prompt learning requires moving from global matching to region-aware alignment that explicitly distinguishes clean cues from noisy ones. To realize this, NA-MVP employs (1) multi-view prompts combined with unbalanced optimal transport to achieve fine-grained patch-to-prompt correspondence while suppressing unreliable regions; (2) a bi-directional prompt design that captures complementary clean-oriented and noise-aware cues, enabling the model to focus on stable semantics; and (3) an alignment-guided selective refinement strategy that uses optimal transport to correct only mislabeled samples while retaining reliable data. Experiments on synthetic and real-world noisy benchmarks demonstrate that NA-MVP consistently outperforms state-of-the-art baselines, confirming its effectiveness in enabling robust few-shot learning under noisy supervision.
Chinese Translation
视觉-语言模型通过提示调优提供强大的少样本能力,但仍然容易受到噪声标签的影响,这可能会破坏提示并降低跨模态对齐的效果。现有的方法往往缺乏建模细粒度语义线索的能力,并且难以自适应地将干净信号与噪声信号分开。为了解决这些挑战,我们提出了NA-MVP,一个通过双向多视角提示对齐进行噪声感知少样本学习的框架。NA-MVP建立在一个关键的概念转变之上:稳健的提示学习需要从全局匹配转向区域感知对齐,明确区分干净线索和噪声线索。为实现这一目标,NA-MVP采用了(1)结合不平衡最优传输的多视角提示,以实现细粒度的补丁与提示对应,同时抑制不可靠区域;(2)双向提示设计,捕捉互补的干净导向和噪声感知线索,使模型能够专注于稳定的语义;(3)一种对齐引导的选择性细化策略,利用最优传输仅纠正错误标记的样本,同时保留可靠数据。在合成和真实世界的噪声基准测试中的实验表明,NA-MVP始终优于最先进的基线,确认其在噪声监督下实现稳健少样本学习的有效性。
cs.CV / 51 / 2603.11618
Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild
Shape-of-You:用于野外语义对应的融合Gromov-Wasserstein最优传输
Abstract
Semantic correspondence is essential for handling diverse in-the-wild images lacking explicit correspondence annotations. While recent 2D foundation models offer powerful features, adapting them for unsupervised learning via nearest-neighbor pseudo-labels has key limitations: it operates locally, ignoring structural relationships, and consequently its reliance on 2D appearance fails to resolve geometric ambiguities arising from symmetries or repetitive features. In this work, we address this by reformulating pseudo-label generation as a Fused Gromov-Wasserstein (FGW) problem, which jointly optimizes inter-feature similarity and intra-structural consistency. Our framework, Shape-of-You (SoY), leverages a 3D foundation model to define this intra-structure in the geometric space, resolving abovementioned ambiguity. However, since FGW is a computationally prohibitive quadratic problem, we approximate it through anchor-based linearization. The resulting probabilistic transport plan provides a structurally consistent but noisy supervisory signal. Thus, we introduce a soft-target loss dynamically blending guidance from this plan with network predictions to build a learning framework robust to this noise. SoY achieves state-of-the-art performance on SPair-71k and AP-10k datasets, establishing a new benchmark in semantic correspondence without explicit geometric annotations. Code is available at Shape-of-You.
Chinese Translation
语义对应对于处理缺乏明确对应注释的多样化野外图像至关重要。尽管最近的2D基础模型提供了强大的特征,但通过最近邻伪标签将其适应于无监督学习存在关键局限性:它在局部操作,忽视了结构关系,因此其对2D外观的依赖未能解决因对称性或重复特征而产生的几何模糊性。在本研究中,我们通过将伪标签生成重新表述为融合Gromov-Wasserstein(FGW)问题来解决这一问题,该问题联合优化特征间相似性和结构内一致性。我们的框架Shape-of-You(SoY)利用3D基础模型在几何空间中定义这一结构内关系,从而解决上述模糊性。然而,由于FGW是一个计算上开销巨大的二次问题,我们通过基于锚点的线性化对其进行近似。生成的概率传输计划提供了一个结构上一致但噪声较大的监督信号。因此,我们引入了一种软目标损失,动态地将来自该计划的指导与网络预测相结合,以构建一个对噪声具有鲁棒性的学习框架。SoY在SPair-71k和AP-10k数据集上实现了最先进的性能,建立了一个在没有明确几何注释的情况下的语义对应新基准。代码可在Shape-of-You获取。
cs.CV / 52 / 2603.11625
MedPruner: Training-Free Hierarchical Token Pruning for Efficient 3D Medical Image Understanding in Vision-Language Models
MedPruner:无训练的分层标记剪枝框架用于高效的3D医学图像理解在视觉-语言模型中
Abstract
While specialized Medical Vision-Language Models (VLMs) have achieved remarkable success in interpreting 2D and 3D medical modalities, their deployment for 3D volumetric data remains constrained by significant computational inefficiencies. Current architectures typically suffer from massive anatomical redundancy due to the direct concatenation of consecutive 2D slices and lack the flexibility to handle heterogeneous information densities across different slices using fixed pruning ratios. To address these challenges, we propose MedPruner, a training-free and model-agnostic hierarchical token pruning framework specifically designed for efficient 3D medical image understanding. MedPruner introduces a two-stage mechanism: an Inter-slice Anchor-based Filtering module to eliminate slice-level temporal redundancy, followed by a Dynamic Information Nucleus Selection strategy that achieves adaptive token-level compression by quantifying cumulative attention weights. Extensive experiments on three 3D medical benchmarks and across three diverse medical VLMs reveal massive token redundancy in existing architectures. Notably, MedPruner enables models such as MedGemma to maintain or even exceed their original performance while retaining fewer than 5% of visual tokens, thereby drastically reducing computational overhead and validating the necessity of dynamic token selection for practical clinical deployment. Our code will be released.
Chinese Translation
尽管专门的医学视觉-语言模型(VLMs)在解释2D和3D医学模态方面取得了显著成功,但其在3D体积数据的应用仍受到显著计算效率低下的限制。目前的架构通常由于直接连接连续的2D切片而导致巨大的解剖冗余,并且缺乏灵活性,无法使用固定的剪枝比率处理不同切片之间异构信息密度。为了解决这些挑战,我们提出了MedPruner,一个无训练且与模型无关的分层标记剪枝框架,专门设计用于高效的3D医学图像理解。MedPruner引入了一个两阶段机制:一个基于切片间锚点的过滤模块,用于消除切片级的时间冗余,随后是一个动态信息核心选择策略,通过量化累积注意力权重实现自适应的标记级压缩。在三个3D医学基准和三个不同的医学VLM上进行的大量实验揭示了现有架构中的巨量标记冗余。值得注意的是,MedPruner使得像MedGemma这样的模型能够在保留不到5%的视觉标记的同时维持或甚至超过其原始性能,从而大幅降低计算开销,并验证了动态标记选择在实际临床部署中的必要性。我们的代码将会发布。
cs.CV / 53 / 2603.11627
Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography
从3D全身正电子发射断层扫描中开发通用分割的基础模型
Abstract
Positron emission tomography (PET) is a key nuclear medicine imaging modality that visualizes radiotracer distributions to quantify in vivo physiological and metabolic processes, playing an irreplaceable role in disease management. Despite its clinical importance, the development of deep learning models for quantitative PET image analysis remains severely limited, driven by both the inherent segmentation challenge from PET's paucity of anatomical contrast and the high costs of data acquisition and annotation. To bridge this gap, we develop generalist foundational models for universal segmentation from 3D whole-body PET imaging. We first build the largest and most comprehensive PET dataset to date, comprising 11041 3D whole-body PET scans with 59831 segmentation masks for model development. Based on this dataset, we present SegAnyPET, an innovative foundational model with general-purpose applicability to diverse segmentation tasks. Built on a 3D architecture with a prompt engineering strategy for mask generation, SegAnyPET enables universal and scalable organ and lesion segmentation, supports efficient human correction with minimal effort, and enables a clinical human-in-the-loop workflow. Extensive evaluations on multi-center, multi-tracer, multi-disease datasets demonstrate that SegAnyPET achieves strong zero-shot performance across a wide range of segmentation tasks, highlighting its potential to advance the clinical applications of molecular imaging.
Chinese Translation
正电子发射断层扫描(PET)是一种关键的核医学成像技术,通过可视化放射性示踪剂的分布来量化体内生理和代谢过程,在疾病管理中发挥着不可替代的作用。尽管其在临床上的重要性,基于深度学习的定量PET图像分析模型的开发仍然受到严重限制,这既源于PET在解剖对比度方面的固有分割挑战,也与数据获取和标注的高成本有关。为了解决这一问题,我们开发了用于3D全身PET成像的通用分割的通用基础模型。我们首先构建了迄今为止最大和最全面的PET数据集,包含11041个3D全身PET扫描和59831个分割掩膜以供模型开发。基于该数据集,我们提出了SegAnyPET,这是一种具有通用适用性的创新基础模型,适用于多种分割任务。SegAnyPET基于3D架构,并采用提示工程策略进行掩膜生成,实现了通用和可扩展的器官和病灶分割,支持高效的人为校正且只需最小努力,并实现了临床人机协作工作流程。在多中心、多示踪剂和多疾病数据集上的广泛评估表明,SegAnyPET在广泛的分割任务中实现了强大的零样本性能,突显了其在推进分子成像临床应用方面的潜力。
cs.CV / 54 / 2603.11633
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
MV-SAM3D:适应性多视角融合用于布局感知的三维生成
Abstract
Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.
Chinese Translation
近年来,统一的三维生成模型在从单幅图像生成高质量三维资产方面取得了显著进展。值得注意的是,像SAM3D这样的布局感知方法能够在保留空间排列的同时重建多个物体,为实际场景级三维生成开辟了新的可能性。然而,当前的方法仅限于单视角输入,无法利用互补的多视角观测,而独立估计的物体姿态往往导致物理上不合理的布局,如物体间穿透和漂浮伪影。我们提出了MV-SAM3D,这是一种无训练框架,扩展了布局感知的三维生成,结合了多视角一致性和物理合理性。我们将多视角融合形式化为三维潜在空间中的多扩散过程,并提出了两种自适应加权策略——注意力熵加权和可见性加权——使得融合过程具有置信度感知,确保每个视角根据其局部观测可靠性做出贡献。对于多物体组合,我们引入了物理感知优化,在生成过程中及之后注入碰撞和接触约束,从而产生物理上合理的物体排列。在标准基准和真实世界多物体场景上的实验表明,重建保真度和布局合理性显著提高,且无需任何额外训练。代码可在 https://github.com/devinli123/MV-SAM3D 获取。
cs.CV / 55 / 2603.11640
Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans
分词技术使多模态大语言模型能够理解、生成和编辑建筑平面图
Abstract
Architectural floor plan design demands joint reasoning over geometry, semantics, and spatial hierarchy, which remains a major challenge for current AI systems. Although recent diffusion and language models improve visual fidelity, they still struggle with coherent spatial reasoning and controllable generation. We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. We introduce discrete room-instance tokens to construct a unified vocabulary that bridges layouts and symbolic reasoning. With multimodal alignment and instruction tuning, the model synthesizes coherent, controllable layouts from text instructions. Experiments show how the framework achieves superior geometric validity and controllability while remaining efficient and locally deployable.
Chinese Translation
建筑平面设计需要对几何、语义和空间层次进行联合推理,这对当前的人工智能系统仍然是一个重大挑战。尽管最近的扩散模型和语言模型提高了视觉保真度,但它们在连贯的空间推理和可控生成方面仍然存在困难。我们提出了HouseMind,一个将平面图理解、生成和编辑统一于一个框架的多模态大语言模型。我们引入了离散房间实例标记,以构建一个统一的词汇,连接布局与符号推理。通过多模态对齐和指令调优,该模型能够从文本指令合成连贯且可控的布局。实验表明,该框架在保持高效和本地可部署的同时,实现了卓越的几何有效性和可控性。
cs.CV / 56 / 2603.11644
IDRL: An Individual-Aware Multimodal Depression-Related Representation Learning Framework for Depression Diagnosis
IDRL:一种个体感知的多模态抑郁相关表示学习框架用于抑郁诊断
Abstract
Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.
Chinese Translation
抑郁症是一种严重的心理障碍,可靠的识别在早期干预和治疗中发挥着关键作用。多模态抑郁检测旨在通过联合建模来自多种模态的互补信息来提高诊断性能。近年来,许多多模态学习方法已被提出用于抑郁分析;然而,这些方法存在以下局限性:1)模态间不一致性和与抑郁无关的干扰,其中抑郁相关线索在不同模态中可能存在冲突,而大量无关内容则掩盖了关键的抑郁信号;2)个体抑郁表现的多样性,导致模态和线索重要性上的个体差异,妨碍了可靠的融合。为了解决这些问题,我们提出了个体感知的多模态抑郁相关表示学习框架(IDRL),以实现稳健的抑郁诊断。具体而言,IDRL 1)将多模态表示解耦为模态共通的抑郁空间、模态特定的抑郁空间和与抑郁无关的空间,以增强模态对齐,同时抑制无关信息;2)引入个体感知的模态融合模块(IAF),根据解耦的抑郁相关特征的预测重要性动态调整其权重,从而实现对不同个体的自适应跨模态融合。大量实验表明,IDRL在多模态抑郁检测中实现了优越且稳健的性能。
cs.CV / 57 / 2603.11659
FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image Segmentation
FL-MedSegBench:一个针对医学图像分割的联邦学习综合基准
Abstract
Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method's generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at https://github.com/meiluzhu/FL-MedSegBench.
Chinese Translation
联邦学习(FL)提供了一种保护隐私的协作医学图像分析范式,无需共享原始数据。然而,缺乏医学图像分割的标准化基准限制了对FL方法的公平和全面评估。为了解决这一问题,我们提出了FL-MedSegBench,这是第一个针对医学图像分割的联邦学习综合基准。我们的基准涵盖了十种成像模式下的九个分割任务,涵盖了2D和3D格式,并具有真实的临床异质性。我们系统地评估了八种通用FL(gFL)和五种个性化FL(pFL)方法,从多个维度进行比较:分割准确性、公平性、通信效率、收敛行为以及对未见领域的泛化能力。大量实验揭示了几个关键见解:(i)pFL方法,特别是那些具有客户端特定批归一化(例如,FedBN)的方法,始终优于通用方法;(ii)没有单一方法在所有情况下都占据绝对优势,性能依赖于数据集;(iii)通信频率分析表明,基于归一化的个性化方法在减少通信频率时表现出显著的鲁棒性;(iv)公平性评估识别出如Ditto和FedRDN等方法,这些方法能够保护表现不佳的客户端;(v)方法对未见领域的泛化能力与其在参与客户端中的表现密切相关。我们将发布一个开源工具包,以促进可重复研究并加速临床适用的FL解决方案,为实际临床部署提供基于实证的指导。源代码可在https://github.com/meiluzhu/FL-MedSegBench获取。
cs.CV / 58 / 2603.11664
BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder
BackdoorIDS:针对预训练视觉编码器的零样本后门检测
Abstract
Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.
Chinese Translation
自监督和多模态视觉编码器学习强大的视觉表征,广泛应用于下游视觉任务和大型视觉语言模型(LVLMs)。然而,下游用户通常依赖于来源不明的第三方预训练编码器,这使他们面临后门攻击的风险。在本研究中,我们提出了BackdoorIDS,这是一种简单而有效的零样本推理时后门样本检测方法,专为预训练视觉编码器设计。BackdoorIDS的提出基于两个观察:注意力劫持和恢复。在逐步输入掩蔽的过程中,带有后门的图像最初将注意力集中在恶意触发特征上。一旦掩蔽比例超过触发器的鲁棒性阈值,触发器将被停用,注意力迅速转向良性内容。这一转变在图像嵌入中引发了显著变化,而干净图像的嵌入则在掩蔽进程中更为平滑地演变。BackdoorIDS通过沿掩蔽轨迹提取嵌入序列并应用基于密度的聚类方法(如DBSCAN)来实现这一信号。如果输入的嵌入序列形成多个聚类,则该输入被标记为带有后门。大量实验表明,BackdoorIDS在不同攻击类型、数据集和模型家族中始终优于现有防御方法。值得注意的是,它是一种即插即用的方法,无需重新训练,并在推理时完全以零样本方式运行,使其与包括卷积神经网络(CNNs)、视觉变换器(ViTs)、CLIP和LLaVA-1.5在内的广泛编码器架构兼容。
cs.CV / 59 / 2603.11675
PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On
PROMO:高效高保真虚拟试穿的可提示装备
Abstract
Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.
Chinese Translation
虚拟试穿(VTON)已成为在线零售的核心能力,现实的试穿结果提供了可靠的合身指导,减少了退货,并使消费者和商家都受益。基于扩散的 VTON 方法实现了照片级真实合成,但通常依赖于复杂的架构,如辅助参考网络,并且采样速度较慢,使得保真度与效率之间的权衡成为一个持续的挑战。我们将 VTON 视为一个结构化的图像编辑问题,要求在三个关键要求下进行强条件生成:主体保留、真实纹理转移和无缝和谐。在这一视角下,我们的训练框架是通用的,并可转移到更广泛的图像编辑任务。此外,VTON 产生的配对数据构成了训练通用编辑器的丰富监督资源。我们提出了 PROMO,一个基于流匹配 DiT(Flow Matching DiT)主干的可提示虚拟试穿框架,采用潜在多模态条件连接。通过利用条件效率和自我参考机制,我们的方法显著降低了推理开销。在标准基准测试中,PROMO 在视觉保真度上超越了之前的 VTON 方法和通用图像编辑模型,同时在质量和速度之间提供了竞争性的平衡。这些结果表明,流匹配变换器结合潜在多模态条件和自我参考加速,为高质量虚拟试穿提供了有效且训练高效的解决方案。
cs.CV / 60 / 2603.11680
UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution
UCAN:用于轻量级超分辨率的统一卷积注意力网络以扩展感受野
Abstract
Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.
Chinese Translation
混合CNN-Transformer架构在图像超分辨率方面取得了良好的效果,但扩展注意力窗口或卷积核会显著增加计算成本,从而限制了在资源受限设备上的部署。我们提出了UCAN,一种轻量级网络,统一了卷积和注意力,以高效地扩展有效感受野。UCAN结合了基于窗口的空间注意力和刺猬注意力机制,以建模局部纹理和长程依赖,并引入了一种基于蒸馏的大核模块,以在不增加大量计算的情况下保留高频结构。此外,我们采用跨层参数共享以进一步降低复杂性。在Manga109($4 imes$)上,UCAN-L以仅48.4G MACs达到了31.63 dB PSNR,超越了最近的轻量级模型。在BSDS100上,UCAN达到了27.79 dB,优于显著更大模型的方法。大量实验表明,UCAN在准确性、效率和可扩展性之间实现了优越的权衡,使其非常适合实际高分辨率图像恢复。
cs.CV / 61 / 2603.11695
PolyCrysDiff: Controllable Generation of Three-Dimensional Computable Polycrystalline Material Structures
PolyCrysDiff:可控生成三维可计算多晶材料结构
Abstract
The three-dimensional (3D) microstructures of polycrystalline materials exert a critical influence on their mechanical and physical properties. Realistic, controllable construction of these microstructures is a key step toward elucidating structure-property relationships, yet remains a formidable challenge. Herein, we propose PolyCrysDiff, a framework based on conditional latent diffusion that enables the end-to-end generation of computable 3D polycrystalline microstructures. Comprehensive qualitative and quantitative evaluations demonstrate that PolyCrysDiff faithfully reproduces target grain morphologies, orientation distributions, and 3D spatial correlations, while achieving an $R^2$ over 0.972 on grain attributes (e.g., size and sphericity) control, thereby outperforming mainstream approaches such as Markov random field (MRF)- and convolutional neural network (CNN)-based methods. The computability and physical validity of the generated microstructures are verified through a series of crystal plasticity finite element method (CPFEM) simulations. Leveraging PolyCrysDiff's controllable generative capability, we systematically elucidate how grain-level microstructural characteristics affect the mechanical properties of polycrystalline materials. This development is expected to pave a key step toward accelerated, data-driven optimization and design of polycrystalline materials.
Chinese Translation
多晶材料的三维(3D)微观结构对其机械和物理性质具有重要影响。现实且可控地构建这些微观结构是阐明结构-性质关系的关键步骤,但仍然是一个巨大的挑战。在此,我们提出了PolyCrysDiff,一个基于条件潜在扩散的框架,能够实现可计算的3D多晶微观结构的端到端生成。全面的定性和定量评估表明,PolyCrysDiff忠实地再现了目标晶粒形态、取向分布和3D空间相关性,同时在晶粒属性(如尺寸和球形度)控制上实现了超过0.972的$R^2$,因此优于主流方法,如马尔可夫随机场(MRF)和卷积神经网络(CNN)方法。通过一系列晶体塑性有限元法(CPFEM)模拟验证了生成微观结构的可计算性和物理有效性。利用PolyCrysDiff的可控生成能力,我们系统地阐明了晶粒级微观结构特征如何影响多晶材料的机械性质。这一发展预计将为加速、数据驱动的多晶材料优化和设计铺平关键步骤。
cs.CV / 62 / 2603.11698
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
OSCBench:文本到视频生成中的对象状态变化基准测试
Abstract
Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
Chinese Translation
文本到视频(T2V)生成模型在生成视觉质量高且时间一致性强的视频方面取得了快速进展。然而,现有的基准测试主要集中在感知质量、文本与视频的对齐或物理合理性上,导致一个关键的动作理解方面——文本提示中明确指定的对象状态变化(OSC)——在很大程度上未被探索。OSC指的是由动作引起的对象状态的转变,例如剥土豆皮或切柠檬。在本文中,我们介绍了OSCBench,一个专门设计用于评估T2V模型中OSC性能的基准。OSCBench基于指导性烹饪数据构建,系统地将动作与对象的交互组织成常规、新颖和组合场景,以探测模型在分布内性能和泛化能力。我们使用人类用户研究和基于多模态大型语言模型(MLLM)的自动评估,对六个代表性的开源和专有T2V模型进行了评估。我们的结果表明,尽管在语义和场景对齐方面表现强劲,当前的T2V模型在准确和时间一致的对象状态变化方面仍然存在持续的困难,尤其是在新颖和组合设置中。这些发现将OSC定位为文本到视频生成中的一个关键瓶颈,并确立了OSCBench作为推进状态感知视频生成模型的诊断基准。
cs.CV / 63 / 2603.11717
COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection
COTONET:基于YOLO11的定制棉花检测算法用于棉铃生长阶段的检测
Abstract
Cotton harvesting is a critical phase where cotton capsules are physically manipulated and can lead to fibre degradation. To maintain the highest quality, harvesting methods must emulate delicate manual grasping, to preserve cotton's intrinsic properties. Automating this process requires systems capable of recognising cotton capsules across various phenological stages. To address this challenge, we propose COTONET, an enhanced custom YOLO11 model tailored with attention mechanisms to improve the detection of difficult instances. The architecture incorporates gradients in non-learnable operations to enhance shape and feature extraction. Key architectural modifications include: the replacement of convolutional blocks with Squeeze-and-Exitation blocks, a redesigned backbone integrating attention mechanisms, and the substitution of standard upsampling operations for Content Aware Reassembly of Features (CARAFE). Additionally, we integrate Simple Attention Modules (SimAM) for primary feature aggregation and Parallel Hybrid Attention Mechanisms (PHAM) for channel-wise, spatial-wise and coordinate-wise attention in the downward neck path. This configuration offers increased flexibility and robustness for interpreting the complexity of cotton crop growth. COTONET aligns with small-to-medium YOLO models utilizing 7.6M parameters and 27.8 GFLOPS, making it suitable for low-resource edge computing and mobile robotics. COTONET outperforms the standard YOLO baselines, achieving a mAP50 of 81.1% and a mAP50-95 of 60.6%.
Chinese Translation
棉花收获是一个关键阶段,在此阶段棉花荚被物理操作,可能导致纤维降解。为了保持最高质量,收获方法必须模拟细致的手工抓取,以保护棉花的内在特性。自动化这一过程需要能够识别不同生长阶段棉花荚的系统。为了解决这一挑战,我们提出了COTONET,一种增强的定制YOLO11模型,采用注意力机制以改善对困难实例的检测。该架构在不可学习的操作中融入梯度,以增强形状和特征提取。关键的架构修改包括:用Squeeze-and-Excitation块替换卷积块,重新设计的主干网络集成注意力机制,以及用内容感知特征重组(Content Aware Reassembly of Features, CARAFE)替代标准上采样操作。此外,我们集成了简单注意力模块(Simple Attention Modules, SimAM)用于主要特征聚合,以及并行混合注意力机制(Parallel Hybrid Attention Mechanisms, PHAM)用于在向下颈部路径中的通道、空间和坐标注意力。这种配置为解释棉花作物生长的复杂性提供了更大的灵活性和鲁棒性。COTONET与小到中型YOLO模型相一致,使用7.6M参数和27.8 GFLOPS,使其适合低资源边缘计算和移动机器人。COTONET在标准YOLO基线中表现优越,达到mAP50为81.1%和mAP50-95为60.6%。
cs.CV / 64 / 2603.11725
Cross-Resolution Attention Network for High-Resolution PM2.5 Prediction
用于高分辨率PM2.5预测的跨分辨率注意力网络
Abstract
Vision Transformers have achieved remarkable success in spatio-temporal prediction, but their scalability remains limited for ultra-high-resolution, continent-scale domains required in real-world environmental monitoring. A single European air-quality map at 1 km resolution comprises 29 million pixels, far beyond the limits of naive self-attention. We introduce CRAN-PM, a dual-branch Vision Transformer that leverages cross-resolution attention to efficiently fuse global meteorological data (25 km) with local high-resolution PM2.5 at the current time (1 km). Instead of including physically driven factors like temperature and topography as input, we further introduce elevation-aware self-attention and wind-guided cross-attention to force the network to learn physically consistent feature representations for PM2.5 forecasting. CRAN-PM is fully trainable and memory-efficient, generating the complete 29-million-pixel European map in 1.8 seconds on a single GPU. Evaluated on daily PM2.5 forecasting throughout Europe in 2022 (362 days, 2,971 European Environment Agency (EEA) stations), it reduces RMSE by 4.7% at T+1 and 10.7% at T+3 compared to the best single-scale baseline, while reducing bias in complex terrain by 36%.
Chinese Translation
视觉变换器在时空预测中取得了显著成功,但其在实际环境监测中对于超高分辨率、洲际范围的应用仍然受到限制。单个1公里分辨率的欧洲空气质量地图包含2900万个像素,远超简单自注意力的处理能力。我们提出了CRAN-PM,这是一种双分支视觉变换器,利用跨分辨率注意力有效融合全球气象数据(25公里)与当前时刻的局部高分辨率PM2.5(1公里)。我们不仅不将温度和地形等物理驱动因素作为输入,还引入了考虑海拔的自注意力和风引导的跨注意力,迫使网络学习物理一致的特征表示以进行PM2.5预测。CRAN-PM完全可训练且内存高效,在单个GPU上生成完整的2900万个像素的欧洲地图仅需1.8秒。在2022年对整个欧洲的每日PM2.5预测评估中(362天,2971个欧洲环境署(EEA)站点),与最佳单尺度基线相比,其T+1的均方根误差(RMSE)降低了4.7%,T+3降低了10.7%,同时在复杂地形中偏差降低了36%。
cs.CV / 65 / 2603.11734
VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On
VTEdit-Bench:虚拟试穿中多参考图像编辑模型的综合基准测试
Abstract
As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.
Chinese Translation
随着虚拟试穿(VTON)的不断发展,越来越多的现实场景出现,超出了现有专用 VTON 模型的能力。同时,通用多参考图像编辑模型快速进展,并在视觉编辑中表现出强大的泛化能力,这为更灵活的 VTON 系统提供了有希望的途径。然而,尽管它们具有强大的能力,通用编辑器在 VTON 中的优缺点仍然未得到充分探索,原因在于缺乏系统的评估基准。为了解决这一问题,我们提出了 VTEdit-Bench,这是一个综合基准,旨在评估通用多参考图像编辑模型在各种现实 VTON 场景中的表现。VTEdit-Bench 包含 24,220 对测试图像,涵盖五个具有逐渐增加复杂性的代表性 VTON 任务,使得对鲁棒性和泛化能力的系统分析成为可能。我们进一步提出了 VTEdit-QA,这是一种基于参考意识的 VLM 评估器,从模型一致性、服装一致性和整体图像质量三个关键方面评估 VTON 性能。通过这一框架,我们系统地评估了八个通用编辑模型,并将其与七个专用 VTON 模型进行了比较。结果表明,顶级通用编辑器在常规任务中具有竞争力,并在更困难的场景中表现出更稳定的泛化能力,但在复杂的参考配置,特别是多服装条件下仍面临挑战。
cs.CV / 66 / 2603.11746
SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
SoulX-LiveAct:朝着小时级实时人类动画的邻域强制与ConvKV记忆
Abstract
Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.
Chinese Translation
自回归(AR)扩散模型通过将扩散建模与因果推断相结合,为视频合成等序列生成任务提供了一个有前景的框架。尽管它们支持流式生成,但现有的AR扩散方法在扩展效率方面存在困难。在本文中,我们识别出小时级实时人类动画中的两个关键挑战。首先,大多数强制策略传播样本级表示时存在不匹配的扩散状态,导致学习信号不一致和收敛不稳定。其次,历史表示无限增长且缺乏结构,阻碍了缓存状态的有效重用,严重限制了推理效率。为了解决这些挑战,我们提出了邻域强制(Neighbor Forcing),这是一种扩散步一致的AR公式,在相同噪声条件下传播时间上相邻的帧作为潜在邻居。该设计在保持AR链中漂移的同时,提供了与分布对齐且稳定的学习信号。在此基础上,我们引入了一种结构化的ConvKV记忆机制,将因果注意力中的键和值压缩为固定长度的表示,从而实现恒定内存推理,并在不依赖短期运动帧记忆的情况下实现真正无限的视频生成。大量实验表明,与现有的AR扩散方法相比,我们的方法显著提高了训练收敛性、小时级生成质量和推理效率。从数值上看,LiveAct实现了小时级实时人类动画,并支持在仅使用两块NVIDIA H100或H200 GPU的情况下以20 FPS进行实时流式推理。定量结果表明,我们的方法在口型同步精度、人类动画质量和情感表现力方面达到了最先进的性能,并且推理成本最低。
cs.CV / 67 / 2603.11755
Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints
通过考虑遮挡的稀疏3D手关节进行可控自我中心视频生成
Abstract
Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.
Chinese Translation
运动可控的视频生成对于虚拟现实和具身人工智能中的自我中心应用至关重要。然而,现有方法往往难以实现3D一致的细粒度手部关节动作。通过采用2D轨迹或隐式姿态,它们将3D几何信息压缩为空间模糊的信号,或过度依赖以人为中心的先验知识。在严重的自我中心遮挡下,这导致了运动不一致和虚幻伪影,并且阻碍了对机器人手的跨具身泛化。为了解决这些局限性,我们提出了一种新颖的框架,该框架从单个参考帧生成自我中心视频,利用稀疏的3D手关节作为与具身无关的控制信号,具有清晰的语义和几何结构。我们引入了一个高效的控制模块,该模块在充分保留3D信息的同时解决了遮挡模糊性。具体而言,它通过惩罚来自隐藏关节的不可靠视觉信号,从源参考帧中提取考虑遮挡的特征,并采用基于3D的加权机制在运动传播过程中稳健地处理动态遮挡的目标关节。同时,该模块直接将3D几何嵌入注入潜在空间,以严格执行结构一致性。为了促进稳健的训练和评估,我们开发了一个自动注释管道,生成超过一百万个高质量的自我中心视频片段,并配有精确的手部轨迹。此外,我们注册了类人运动学和相机数据,以构建跨具身基准。大量实验表明,我们的方法显著优于最先进的基线,生成具有现实交互的高保真自我中心视频,并展现出对机器人手的卓越跨具身泛化能力。
cs.CV / 68 / 2603.11783
HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification
HELM:基于图学习的层次化和显式标签建模用于多标签图像分类
Abstract
Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.
Chinese Translation
层次化多标签分类(HMLC)在遥感中对于建模复杂的标签依赖关系至关重要。然而,现有方法在多路径层次结构中表现不佳,实例属于多个分支,并且很少利用未标记的数据。我们提出了HELM( extit{层次化和显式标签建模}),一个克服这些限制的新框架。HELM:(i)在视觉变换器中使用特定于层次的类标记,以捕捉细微的标签交互;(ii)采用图卷积网络显式编码层次结构并生成层次感知的嵌入;(iii)集成自监督分支,以有效利用未标记的图像。我们在四个遥感图像(RSI)数据集(UCM、AID、DFC-15、MLRSNet)上进行了全面评估。HELM实现了最先进的性能,在监督和半监督设置中始终优于强基线,特别在低标签场景中表现出色。
cs.CV / 69 / 2603.11793
Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder
在 CLIP 的视觉编码器中定位注意力头级别的人口统计偏差
Abstract
Standard fairness audits of foundation models quantify that a model is biased, but not where inside the network the bias resides. We propose a mechanistic fairness audit that combines projected residual-stream decomposition, zero-shot Concept Activation Vectors, and bias-augmented TextSpan analysis to locate demographic bias at the level of individual attention heads in vision transformers. As a feasibility case study, we apply this pipeline to the CLIP ViT-L-14 encoder on 42 profession classes of the FACET benchmark, auditing both gender and age bias. For gender, the pipeline identifies four terminal-layer heads whose ablation reduces global bias (Cramer's V: 0.381 -> 0.362) while marginally improving accuracy (+0.42%); a layer-matched random control confirms that this effect is specific to the identified heads. A single head in the final layer contributes to the majority of the reduction in the most stereotyped classes, and class-level analysis shows that corrected predictions shift toward the correct occupation. For age, the same pipeline identifies candidate heads, but ablation produces weaker and less consistent effects, suggesting that age bias is encoded more diffusely than gender bias in this model. These results provide preliminary evidence that head-level bias localisation is feasible for discriminative vision encoders and that the degree of localisability may vary across protected attributes. keywords: Bias . CLIP . Mechanistic Interpretability . Vision Transformer . Fairness
Chinese Translation
标准的基础模型公平性审计量化了模型的偏差,但并未指出偏差在网络中的具体位置。我们提出了一种机械性公平性审计方法,结合了投影残差流分解、零样本概念激活向量和偏差增强的文本片段分析,以在视觉变换器的单个注意力头级别定位人口统计偏差。作为可行性案例研究,我们将该流程应用于 CLIP ViT-L-14 编码器,针对 FACET 基准的 42 个职业类别进行审计,关注性别和年龄偏差。在性别方面,该流程识别出四个终层头,其消融实验减少了全球偏差(Cramer's V: 0.381 -> 0.362),同时略微提高了准确性(+0.42%);层匹配的随机对照实验确认这一效果特定于识别出的头。在最后一层中,一个头对减少最刻板类别的偏差贡献最大,类别级分析显示,修正后的预测向正确职业转移。对于年龄,该流程同样识别出候选头,但消融实验产生的效果较弱且不一致,表明在该模型中年龄偏差的编码比性别偏差更为分散。这些结果提供了初步证据,表明头级偏差定位对于区分性视觉编码器是可行的,并且可定位性程度可能在不同的受保护属性之间有所不同。关键词:偏差 . CLIP . 机械解释性 . 视觉变换器 . 公平性
cs.CV / 70 / 2603.11795
Intrinsic Concept Extraction Based on Compositional Interpretability
基于可组合可解释性的内在概念提取
Abstract
Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.
Chinese Translation
无监督概念提取旨在从单幅图像中提取概念;然而,现有方法在提取可组合的内在概念方面存在局限。为了解决这一问题,本文提出了一项新任务,称为可组合和可解释的内在概念提取(Compositional and Interpretable Intrinsic Concept Extraction, CI-ICE)。CI-ICE任务旨在利用基于扩散的文本到图像模型,从单幅图像中提取可组合的对象级和属性级概念,以便通过这些概念的组合重建原始概念。为实现这一目标,我们提出了一种名为HyperExpress的方法,该方法通过两个核心方面来解决CI-ICE任务。具体而言,首先,我们提出了一种概念学习方法,该方法利用双曲空间固有的层次建模能力,实现准确的概念解耦,同时保留概念之间的层次结构和关系依赖;其次,我们引入了一种基于概念的优化方法,该方法将概念嵌入空间映射,以维持复杂的概念间关系,同时确保概念的可组合性。我们的方法在从单幅图像中提取可组合可解释的内在概念方面表现出色。
cs.CV / 71 / 2603.11804
OSM-based Domain Adaptation for Remote Sensing VLMs
基于开放街图的遥感视觉语言模型领域适应
Abstract
Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
Chinese Translation
适应于遥感的视觉语言模型(VLMs)在很大程度上依赖于特定领域的图像-文本监督,然而,卫星和航空影像的高质量注释仍然稀缺且生产成本高昂。现有的伪标签生成管道通过从大型前沿模型中提取知识来填补这一空白,但对大型教师模型的依赖是昂贵的,限制了可扩展性,并将可实现的性能限制在教师模型的上限。我们提出了OSMDA:一个自包含的领域适应框架,消除了这种依赖。我们的关键见解是,一个强大的基础VLM可以作为其自身的注释引擎:通过将航空图像与渲染的开放街图(OpenStreetMap, OSM)瓦片配对,我们利用模型的光学字符识别和图表理解能力生成由OSM丰富的辅助元数据增强的标题。然后,模型在仅包含卫星影像的结果语料库上进行微调,产生了OSMDA-VLM,一个无需手动标注且不依赖更强外部模型的领域适应VLM。我们进行了全面的评估,涵盖了10个基准测试,涉及图像-文本到文本的任务,并与9个竞争基线进行比较。当与真实数据等比例混合时,我们的方法实现了最先进的结果,同时训练成本显著低于依赖教师的替代方案。这些结果表明,考虑到强大的基础模型,与众包地理数据的对齐是实现遥感领域适应的可行且可扩展的路径。数据集和模型权重将公开发布。
cs.CV / 72 / 2603.11810
CEI-3D: Collaborative Explicit-Implicit 3D Reconstruction for Realistic and Fine-Grained Object Editing
CEI-3D:用于现实和细粒度对象编辑的协作显式-隐式三维重建
Abstract
Existing 3D editing methods often produce unrealistic and unrefined results due to the deeply integrated nature of their reconstruction networks. To address the challenge, this paper introduces CEI-3D, an editing-oriented reconstruction pipeline designed to facilitate realistic and fine-grained editing. Specifically, we propose a collaborative explicit-implicit reconstruction approach, which represents the target object using an implicit SDF network and a differentially sampled, locally controllable set of handler points. The implicit network provides a smooth and continuous geometry prior, while the explicit handler points offer localized control, enabling mutual guidance between the global 3D structure and user-specified local editing regions. To independently control each attribute of the handler points, we design a physical properties disentangling module to decouple the color of the handler points into separate physical properties. We also propose a dual-diffuse-albedo network in this module to process the edited and non-edited regions through separate branches, thereby preventing undesired interference from editing operations. Building on the reconstructed collaborative explicit-implicit representation with disentangled properties, we introduce a spatial-aware editing module that enables part-wise adjustment of relevant handler points. This module employs a cross-view propagation-based 3D segmentation strategy, which helps users to edit the specified physical attributes of a target part efficiently. Extensive experiments on both real and synthetic datasets demonstrate that our approach achieves more realistic and fine-grained editing results than the state-of-the-art (SOTA) methods while requiring less editing time. Our code is available on https://github.com/shiyue001/CEI-3D.
Chinese Translation
现有的三维编辑方法由于其重建网络的深度集成特性,往往产生不真实和不精细的结果。为了解决这一挑战,本文提出了CEI-3D,一种面向编辑的重建管道,旨在促进现实和细粒度的编辑。具体而言,我们提出了一种协作显式-隐式重建方法,该方法使用隐式SDF网络和一组差异采样的、局部可控的处理点来表示目标对象。隐式网络提供了平滑和连续的几何先验,而显式处理点则提供了局部控制,使得全局三维结构与用户指定的局部编辑区域之间能够相互引导。为了独立控制每个处理点的属性,我们设计了一个物理属性解耦模块,将处理点的颜色解耦为不同的物理属性。我们还在该模块中提出了一个双漫反射-反照率网络,通过独立的分支处理编辑和未编辑区域,从而防止编辑操作带来的不必要干扰。在具有解耦属性的重建协作显式-隐式表示的基础上,我们引入了一个空间感知编辑模块,使得相关处理点能够进行逐部分调整。该模块采用基于交叉视图传播的三维分割策略,帮助用户高效地编辑目标部分的指定物理属性。在真实和合成数据集上的大量实验表明,我们的方法在编辑结果的现实性和细粒度方面优于最新的(SOTA)方法,同时所需的编辑时间更少。我们的代码可在 https://github.com/shiyue001/CEI-3D 获取。
cs.CV / 73 / 2603.11827
Multimodal classification of Radiation-Induced Contrast Enhancements and tumor recurrence using deep learning
基于深度学习的辐射诱导对比增强与肿瘤复发的多模态分类
Abstract
The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model's focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.
Chinese Translation
在接受治疗的胶质母细胞瘤患者中,肿瘤复发与辐射诱导的对比增强之间的区分仍然是一个主要的临床挑战。现有的方法依赖于临床上稀缺的扩散MRI,或者未考虑辐射图,这在肿瘤委员会中日益受到关注。我们提出了RICE-NET,这是一种多模态3D深度学习模型,整合了纵向MRI数据与放疗剂量分布,以使用常规T1加权MRI数据进行自动化病灶分类。在92名患者的队列中,该模型在独立测试集上达到了0.92的F1分数。在广泛的消融实验中,我们量化了每个时间点和模态的贡献,并显示可靠的分类在很大程度上依赖于辐射图。基于遮挡的可解释性分析进一步确认了模型对临床相关区域的关注。这些发现突显了多模态深度学习在提高诊断准确性和支持神经肿瘤学临床决策中的潜力。
cs.CV / 74 / 2603.11831
Towards High-Fidelity CAD Generation via LLM-Driven Program Generation and Text-Based B-Rep Primitive Grounding
通过大语言模型驱动的程序生成和基于文本的边界表示原语定位实现高保真CAD生成
Abstract
The field of Computer-Aided Design (CAD) generation has made significant progress in recent years. Existing methods typically fall into two separate categorie: parametric CAD modeling and direct boundary representation (B-Rep) synthesis. In modern feature-based CAD systems, parametric modeling and B-Rep are inherently intertwined, as advanced parametric operations (e.g., fillet and chamfer) require explicit selection of B-Rep geometric primitives, and the B-Rep itself is derived from parametric operations. Consequently, this paradigm gap remains a critical factor limiting AI-driven CAD modeling for complex industrial product design. This paper present FutureCAD, a novel text-to-CAD framework that leverages large language models (LLMs) and a B-Rep grounding transformer (BRepGround) for high-fidelity CAD generation. Our method generates executable CadQuery scripts, and introduces a text-based query mechanism that enables the LLM to specify geometric selections via natural language, which BRepGround then grounds to the target primitives. To train our framework, we construct a new dataset comprising real-world CAD models. For the LLM, we apply supervised fine-tuning (SFT) to establish fundamental CAD generation capabilities, followed by reinforcement learning (RL) to improve generalization. Experiments show that FutureCAD achieves state-of-the-art CAD generation performance.
Chinese Translation
计算机辅助设计(CAD)生成领域近年来取得了显著进展。现有方法通常分为两类:参数化CAD建模和直接边界表示(B-Rep)合成。在现代基于特征的CAD系统中,参数化建模与B-Rep本质上是交织在一起的,因为高级参数化操作(例如圆角和倒角)需要明确选择B-Rep几何原语,而B-Rep本身是由参数化操作导出的。因此,这一范式差距仍然是限制AI驱动的复杂工业产品设计CAD建模的关键因素。本文提出了FutureCAD,这是一种新颖的文本到CAD框架,利用大语言模型(LLMs)和边界表示定位变换器(BRepGround)实现高保真CAD生成。我们的方法生成可执行的CadQuery脚本,并引入了一种基于文本的查询机制,使LLM能够通过自然语言指定几何选择,BRepGround随后将其定位到目标原语。为了训练我们的框架,我们构建了一个包含真实世界CAD模型的新数据集。对于LLM,我们应用监督微调(SFT)建立基本的CAD生成能力,随后通过强化学习(RL)提高泛化能力。实验表明,FutureCAD在CAD生成性能上达到了最先进水平。
cs.CV / 75 / 2603.11836
A Decade of Generative Adversarial Networks for Porous Material Reconstruction
生成对抗网络在多孔材料重建中的十年发展
Abstract
Digital reconstruction of porous materials has become increasingly critical for applications ranging from geological reservoir characterization to tissue engineering and electrochemical device design. While traditional methods such as micro-computed tomography and statistical reconstruction approaches have established foundations in this field, the emergence of deep learning techniques, particularly Generative Adversarial Networks (GANs), has revolutionized porous media reconstruction capabilities. This review systematically analyzes 96 peer-reviewed articles published from 2017 to early 2026, examining the evolution and applications of GAN-based approaches for porous material image reconstruction. We categorize GAN architectures into six distinct classes, namely Vanilla GANs, Multi-Scale GANs, Conditional GANs, Attention-Enhanced GANs, Style-based GANs, and Hybrid Architecture GANs. Our analysis reveals substantial progress including improvements in porosity accuracy (within 1% of original samples), permeability prediction (up to 79% reduction in mean relative errors), and achievable reconstruction volumes (from initial $64^3$ to current $2{,}200^3$ voxels). Despite these advances, persistent challenges remain in computational efficiency, memory constraints for large-scale reconstruction, and maintaining structural continuity in 2D-to-3D transformations. This systematic analysis provides a comprehensive framework for selecting appropriate GAN architectures based on specific application requirements.
Chinese Translation
多孔材料的数字重建在从地质储层表征到组织工程和电化学设备设计等应用中变得越来越重要。虽然传统方法如微计算机断层扫描和统计重建方法在这一领域奠定了基础,但深度学习技术的出现,特别是生成对抗网络(GANs),彻底改变了多孔介质的重建能力。本文系统分析了2017年至2026年初发表的96篇同行评审文章,考察了基于GAN的方法在多孔材料图像重建中的演变和应用。我们将GAN架构分为六个不同的类别,即基础GAN、多尺度GAN、条件GAN、增强注意力的GAN、基于风格的GAN和混合架构GAN。我们的分析显示了显著的进展,包括孔隙率准确性提升(与原始样本相差不超过1%)、渗透率预测(平均相对误差减少高达79%)和可实现的重建体积(从最初的$64^3$增加到当前的$2{,}200^3$体素)。尽管取得了这些进展,但在计算效率、大规模重建的内存限制以及在2D到3D转换中保持结构连续性方面仍然存在持续的挑战。这一系统分析为根据特定应用需求选择合适的GAN架构提供了全面的框架。
cs.CV / 76 / 2603.11846
ZeroSense:How Vision matters in Long Context Compression
ZeroSense:视觉在长上下文压缩中的重要性
Abstract
Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.
Chinese Translation
近期的视觉-文本压缩(VTC)方法,如DeepSeek-OCR,通过利用文本到图像的渲染,在长上下文建模任务中报告了令人印象深刻的高标记压缩比。然而,现有的评估协议严重依赖于下游任务的性能。这些评估指标未能准确测量文本的保留,因为多模态大语言模型(MLLMs)具有强大的内在语言先验。在本研究中,我们引入了一种新的评估框架,以解耦MLLMs的能力,从而真实评估VTC质量。在该框架内,我们进一步引入了ZeroSense基准,以确保测试样本之间的低语义相关性。通过消除上下文依赖性,我们的基准保证评估结果仅反映VTC质量,而不受下游模型语义推理能力的影响。在多个数据集上的广泛实验表明,VTC质量与下游任务准确性显著偏离,突显了我们解耦评估框架的必要性。
cs.CV / 77 / 2603.11866
Derain-Agent: A Plug-and-Play Agent Framework for Rainy Image Restoration
Derain-Agent:一种即插即用的雨天图像恢复代理框架
Abstract
While deep learning has advanced single-image deraining, existing models suffer from a fundamental limitation: they employ a static inference paradigm that fails to adapt to the complex, coupled degradations (e.g., noise artifacts, blur, and color deviation) of real-world rain. Consequently, restored images often exhibit residual artifacts and inconsistent perceptual quality. In this work, we present Derain-Agent, a plug-and-play refinement framework that transitions deraining from static processing to dynamic, agent-based restoration. Derain-Agent equips a base deraining model with two core capabilities: 1) a Planning Network that intelligently schedules an optimal sequence of restoration tools for each instance, and 2) a Strength Modulation mechanism that applies these tools with spatially adaptive intensity. This design enables precise, region-specific correction of residual errors without the prohibitive cost of iterative search. Our method demonstrates strong generalization, consistently boosting the performance of state-of-the-art deraining models on both synthetic and real-world benchmarks.
Chinese Translation
尽管深度学习在单图像去雨方面取得了进展,但现有模型存在一个根本性限制:它们采用静态推理范式,无法适应真实世界雨水的复杂耦合退化(例如,噪声伪影、模糊和颜色偏差)。因此,恢复的图像往往表现出残余伪影和不一致的感知质量。在本研究中,我们提出了Derain-Agent,一种将去雨过程从静态处理转变为动态代理基础恢复的即插即用精细化框架。Derain-Agent为基础去雨模型提供了两个核心能力:1)一个规划网络(Planning Network),智能地为每个实例调度最佳的恢复工具序列;2)一个强度调制机制(Strength Modulation),以空间自适应的强度应用这些工具。该设计能够在不需要高昂的迭代搜索成本的情况下,精确地针对特定区域纠正残余错误。我们的方法展示了强大的泛化能力,持续提升了最先进的去雨模型在合成和真实世界基准上的性能。
cs.CV / 78 / 2603.11888
Single-View Rolling-Shutter SfM
单视图滚动快门结构光重建
Abstract
Rolling-shutter (RS) cameras are ubiquitous, but RS SfM (structure-from-motion) has not been fully solved yet. This work suggests an approach to remedy this: We characterize RS single-view geometry of observed world points or lines. Exploiting this geometry, we describe which motion and scene parameters can be recovered from a single RS image and systematically derive minimal reconstruction problems. We evaluate several representative cases with proof-of-concept solvers, highlighting both feasibility and practical limitations.
Chinese Translation
滚动快门(RS)相机无处不在,但滚动快门结构光重建(SfM)尚未完全解决。本文提出了一种解决方案:我们对观察到的世界点或线的滚动快门单视图几何特性进行表征。利用这一几何特性,我们描述了可以从单个滚动快门图像中恢复的运动和场景参数,并系统地推导出最小重建问题。我们评估了几个具有代表性的案例,并提供了概念验证求解器,突出了可行性和实际限制。
cs.CV / 79 / 2603.11911
InSpatio-WorldFM: An Open-Source Real-Time Generative Frame Model
InSpatio-WorldFM:一个开源的实时生成框架模型
Abstract
We present InSpatio-WorldFM, an open-source real-time frame model for spatial intelligence. Unlike video-based world models that rely on sequential frame generation and incur substantial latency due to window-level processing, InSpatio-WorldFM adopts a frame-based paradigm that generates each frame independently, enabling low-latency real-time spatial inference. By enforcing multi-view spatial consistency through explicit 3D anchors and implicit spatial memory, the model preserves global scene geometry while maintaining fine-grained visual details across viewpoint changes. We further introduce a progressive three-stage training pipeline that transforms a pretrained image diffusion model into a controllable frame model and finally into a real-time generator through few-step distillation. Experimental results show that InSpatio-WorldFM achieves strong multi-view consistency while supporting interactive exploration on consumer-grade GPUs, providing an efficient alternative to traditional video-based world models for real-time world simulation.
Chinese Translation
我们提出了InSpatio-WorldFM,一个用于空间智能的开源实时框架模型。与依赖于顺序帧生成并因窗口级处理而导致显著延迟的视频基础世界模型不同,InSpatio-WorldFM采用基于帧的范式,独立生成每一帧,从而实现低延迟的实时空间推理。通过显式的3D锚点和隐式空间记忆强制执行多视角空间一致性,该模型在保持全局场景几何形状的同时,能够在视角变化中维持细致的视觉细节。我们进一步介绍了一个渐进的三阶段训练流程,将预训练的图像扩散模型转化为可控的框架模型,最终通过少量步骤的蒸馏转变为实时生成器。实验结果表明,InSpatio-WorldFM在支持消费者级GPU上的交互式探索的同时,达到了强大的多视角一致性,为实时世界模拟提供了传统视频基础世界模型的高效替代方案。
cs.CV / 80 / 2603.11917
PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation
PicoSAM3:实时传感器内感兴趣区域分割
Abstract
Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3 M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82 ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level.
Chinese Translation
实时的设备内分割对于延迟敏感和注重隐私的应用(如智能眼镜和物联网设备)至关重要。我们介绍了PicoSAM3,这是一种轻量级的可提示视觉分割模型,针对边缘和传感器内执行进行了优化,包括在Sony IMX500视觉传感器上的部署。PicoSAM3具有130万参数,结合了密集卷积神经网络架构、感兴趣区域提示编码、有效通道注意力(Efficient Channel Attention)以及来自SAM2和SAM3的知识蒸馏。在COCO和LVIS数据集上,PicoSAM3分别达到了65.45%和64.01%的平均交并比(mIoU),在相似或更低复杂度下超越了现有的基于SAM的和面向边缘的基准。经过INT8量化的模型在IMX500上以11.82毫秒的延迟实现实时传感器内推理,保持了精度且几乎没有降级,完全符合其内存和操作约束。消融研究表明,从大型SAM模型进行蒸馏可使mIoU提升高达+14.5%,并证明高质量、空间灵活的可提示分割在传感器级别直接实现是可行的。
cs.CV / 81 / 2603.11952
Preliminary analysis of RGB-NIR Image Registration techniques for off-road forestry environments
越野林业环境下RGB-NIR图像配准技术的初步分析
Abstract
RGB-NIR image registration plays an important role in sensor-fusion, image enhancement and off-road autonomy. In this work, we evaluate both classical and Deep Learning (DL) based image registration techniques to access their suitability for off-road forestry applications. NeMAR, trained under 6 different configurations, demonstrates partial success however, its GAN loss instability suggests challenges in preserving geometric consistency. MURF, when tested on off-road forestry data shows promising large scale feature alignment during shared information extraction but struggles with fine details in dense vegetation. Even though this is just a preliminary evaluation, our study necessitates further refinements for robust, multi-scale registration for off-road forest applications.
Chinese Translation
RGB-NIR图像配准在传感器融合、图像增强和越野自主导航中发挥着重要作用。在本研究中,我们评估了经典的和基于深度学习(Deep Learning, DL)的图像配准技术,以评估其在越野林业应用中的适用性。NeMAR在6种不同配置下进行训练,显示出部分成功,但其生成对抗网络(GAN)损失的不稳定性表明在保持几何一致性方面存在挑战。MURF在越野林业数据上的测试显示出在共享信息提取过程中具有良好的大规模特征对齐能力,但在密集植被中的细节处理上存在困难。尽管这仅仅是初步评估,但我们的研究需要进一步改进,以实现越野森林应用的稳健多尺度配准。
cs.CV / 82 / 2603.11969
AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies
AstroSplat:基于物理的高斯点云渲染与小天体重建
Abstract
Image-based surface reconstruction and characterization are crucial for missions to small celestial bodies (e.g., asteroids), as it informs mission planning, navigation, and scientific analysis. Recent advances in Gaussian splatting enable high-fidelity neural scene representations but typically rely on a spherical harmonic intensity parameterization that is strictly appearance-based and does not explicitly model material properties or light-surface interactions. We introduce AstroSplat, a physics-based Gaussian splatting framework that integrates planetary reflectance models to improve the autonomous reconstruction and photometric characterization of small-body surfaces from in-situ imagery. The proposed framework is validated on real imagery taken by NASA's Dawn mission, where we demonstrate superior rendering performance and surface reconstruction accuracy compared to the typical spherical harmonic parameterization.
Chinese Translation
基于图像的表面重建和特征表征对于小天体(例如小行星)任务至关重要,因为它为任务规划、导航和科学分析提供了重要信息。最近在高斯点云技术上的进展使得高保真神经场景表示成为可能,但通常依赖于严格基于外观的球谐强度参数化,并未明确建模材料属性或光-表面交互。我们提出了AstroSplat,一个基于物理的高斯点云框架,集成了行星反射模型,以改善从原位图像中对小天体表面的自主重建和光度特征表征。所提出的框架在NASA的“黎明”任务拍摄的真实图像上进行了验证,我们展示了与典型的球谐参数化相比,具有更优的渲染性能和表面重建精度。
cs.CV / 83 / 2603.11971
Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling
通过双向交叉注意力和时间建模的多模态情感识别
Abstract
Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.
Chinese Translation
在自然场景视频数据中,情感识别仍然是一个具有挑战性的问题,因为面部外观、头部姿态、光照、背景噪声以及人类情感的动态特性存在较大变化。仅依赖单一模态,如面部表情或语音,往往不足以捕捉这些复杂的情感线索。为了解决这一问题,我们提出了一种多模态情感识别框架,旨在第十届情感行为分析(ABAW)挑战中的表情(EXPR)识别任务。我们的方法利用大规模预训练模型,具体而言,使用 CLIP 进行视觉编码和 Wav2Vec 2.0 进行音频表示学习,作为冻结的主干网络。为了建模面部表情序列中的时间依赖性,我们在固定长度的视频窗口上采用时间卷积网络(TCN)。此外,我们引入了一个双向交叉注意力融合模块,其中视觉和音频特征对称交互,以增强跨模态的上下文化并捕捉互补的情感信息。然后使用轻量级分类头进行最终情感预测。我们进一步结合基于 CLIP 文本特征的文本引导对比目标,以促进语义对齐的视觉表示。在 ABAW 第十届 EXPR 基准上的实验结果表明,所提出的框架提供了强大的多模态基线,并在单模态建模上实现了性能提升。这些结果证明了结合时间视觉建模、音频表示学习和跨模态融合在无约束的真实世界环境中进行稳健情感识别的有效性。
cs.CV / 84 / 2603.11975
HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
HomeSafe-Bench:评估家庭场景中具身智能体的危险行为检测的视觉-语言模型
Abstract
The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
Chinese Translation
具身智能体的快速发展加速了家庭机器人在现实环境中的部署。然而,与结构化的工业环境不同,家庭空间引入了不可预测的安全风险,其中系统限制(如感知延迟和缺乏常识知识)可能导致危险错误。目前的安全评估通常局限于静态图像、文本或一般性危害,未能充分基准化这些特定背景下的动态危险行为检测。为填补这一空白,我们引入了 extbf{HomeSafe-Bench},这是一个旨在评估视觉-语言模型(VLMs)在家庭场景中危险行为检测的挑战性基准。HomeSafe-Bench通过结合物理模拟与先进视频生成的混合管道构建,涵盖了六个功能领域的438个多样化案例,并提供了细粒度的多维注释。除了基准测试,我们还提出了 extbf{家庭安全分层双脑守护系统(HD-Guard)},这是一种用于实时安全监控的分层流式架构。HD-Guard协调一个轻量级的FastBrain进行持续的高频筛查,同时使用一个异步的大规模SlowBrain进行深度多模态推理,有效平衡了推理效率与检测准确性。评估结果表明,HD-Guard在延迟与性能之间实现了优越的权衡,而我们的分析识别了当前基于VLM的安全检测中的关键瓶颈。
cs.CV / 85 / 2603.11984
Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation
Ada3Drift:用于一步3D视觉运动机器人操控的自适应训练时漂移
Abstract
Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.
Chinese Translation
基于扩散的视觉运动策略通过迭代去噪有效捕捉多模态动作分布,但其高推理延迟限制了实时机器人控制。最近的流匹配和一致性基础方法实现了单步生成,但牺牲了保留不同动作模式的能力,将多模态行为压缩为平均的、通常在物理上不可行的轨迹。我们观察到,机器人领域中计算预算的不对称性(离线训练与实时推理)自然促使我们通过将迭代精炼从推理时间转移到训练时间来恢复这种多模态的保真度。基于这一见解,我们提出了Ada3Drift,它学习一个训练时漂移场,将预测的动作吸引到专家演示模式,同时将其排斥于其他生成样本,从而实现从3D点云观测中高保真度的单步生成(1 NFE)。为了处理少样本机器人领域,Ada3Drift进一步引入了一个sigmoid调度的损失过渡,从粗分布学习到模式锐化精炼,以及多尺度场聚合,捕捉在不同空间粒度下的动作模式。在三个仿真基准(Adroit、Meta-World和RoboTwin)以及真实世界的机器人操控任务上的实验表明,Ada3Drift在实现最先进的性能的同时,所需的函数评估次数比基于扩散的方法少$10 imes$。
cs.CV / 86 / 2603.12008
CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation
CrossEarth-SAR:一个以SAR为中心的十亿规模地理空间基础模型,用于领域可泛化的语义分割
Abstract
Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10\% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.
Chinese Translation
合成孔径雷达(SAR)能够实现全球范围内的全天候地球观测。然而,由于成像机制的多样性,传感器和区域之间的领域转移严重阻碍了其语义泛化。为了解决这个问题,我们提出了CrossEarth-SAR,这是第一个基于新颖的物理引导稀疏专家混合(MoE)架构构建的十亿规模SAR视觉基础模型,结合了物理描述符,专门设计用于跨领域的语义分割。为了促进大规模的预训练,我们开发了CrossEarth-SAR-200K,这是一个统一了公共和私有SAR影像的弱监督和全监督数据集。我们还引入了一个基准套件,包含8个不同领域差距的22个子基准,建立了SAR影像领域泛化语义分割的第一个统一标准。大量实验表明,CrossEarth-SAR在20个基准上达到了最先进的结果,在某些基准下的多领域迁移中,mIoU超过了之前方法10%以上。所有代码、基准和数据集将公开提供。
cs.CV / 87 / 2603.12013
Pano360: Perspective to Panoramic Vision with Geometric Consistency
Pano360:具有几何一致性的透视到全景视觉
Abstract
Prior panorama stitching approaches heavily rely on pairwise feature correspondences and are unable to leverage geometric consistency across multiple views. This leads to severe distortion and misalignment, especially in challenging scenes with weak textures, large parallax, and repetitive patterns. Given that multi-view geometric correspondences can be directly constructed in 3D space, making them more accurate and globally consistent, we extend the 2D alignment task to the 3D photogrammetric space. We adopt a novel transformer-based architecture to achieve 3D awareness and aggregate global information across all views. It directly utilizes camera poses to guide image warping for global alignment in 3D space and employs a multi-feature joint optimization strategy to compute the seams. Additionally, to establish an evaluation benchmark and train our network, we constructed a large-scale dataset of real-world scenes. Extensive experiments show that our method significantly outperforms existing alternatives in alignment accuracy and perceptual quality.
Chinese Translation
以往的全景拼接方法严重依赖成对特征对应关系,无法利用多个视角之间的几何一致性。这导致在具有弱纹理、大视差和重复模式的挑战性场景中出现严重的失真和错位。鉴于多视角几何对应关系可以直接在三维空间中构建,使其更加准确且具有全局一致性,我们将二维对齐任务扩展到三维摄影测量空间。我们采用了一种新颖的基于变换器的架构,以实现三维感知并聚合所有视角的全局信息。该方法直接利用相机姿态指导图像变形,以实现三维空间中的全局对齐,并采用多特征联合优化策略来计算接缝。此外,为了建立评估基准并训练我们的网络,我们构建了一个大规模的真实场景数据集。大量实验表明,我们的方法在对齐精度和感知质量上显著优于现有的替代方案。
cs.CV / 88 / 2603.12016
Nyxus: A Next Generation Image Feature Extraction Library for the Big Data and AI Era
Nyxus:面向大数据和人工智能时代的下一代图像特征提取库
Abstract
Modern imaging instruments can produce terabytes to petabytes of data for a single experiment. The biggest barrier to processing big image datasets has been computational, where image analysis algorithms often lack the efficiency needed to process such large datasets or make tradeoffs in robustness and accuracy. Deep learning algorithms have vastly improved the accuracy of the first step in an analysis workflow (region segmentation), but the expansion of domain specific feature extraction libraries across scientific disciplines has made it difficult to compare the performance and accuracy of extracted features. To address these needs, we developed a novel feature extraction library called Nyxus. Nyxus is designed from the ground up for scalable out-of-core feature extraction for 2D and 3D image data and rigorously tested against established standards. The comprehensive feature set of Nyxus covers multiple biomedical domains including radiomics and cellular analysis, and is designed for computational scalability across CPUs and GPUs. Nyxus has been packaged to be accessible to users of various skill sets and needs: as a Python package for code developers, a command line tool, as a Napari plugin for low to no-code users or users that want to visualize results, and as an Open Container Initiative (OCI) compliant container that can be used in cloud or super-computing workflows aimed at processing large data sets. Further, Nyxus enables a new methodological approach to feature extraction allowing for programmatic tuning of many features sets for optimal computational efficiency or coverage for use in novel machine learning and deep learning applications.
Chinese Translation
现代成像仪器能够为单个实验生成从千兆字节到太字节的数据。处理大规模图像数据集的最大障碍在于计算能力,图像分析算法往往缺乏处理如此大数据集所需的效率,或者在鲁棒性和准确性之间做出权衡。深度学习算法显著提高了分析工作流程中第一步(区域分割)的准确性,但特定领域特征提取库在各科学学科的扩展使得比较提取特征的性能和准确性变得困难。为了解决这些需求,我们开发了一种新颖的特征提取库,称为Nyxus。Nyxus从零开始设计,旨在实现可扩展的外存特征提取,适用于2D和3D图像数据,并经过严格测试以符合既定标准。Nyxus的全面特征集涵盖多个生物医学领域,包括放射组学和细胞分析,并设计为在CPU和GPU上具有计算可扩展性。Nyxus已被打包为适合不同技能水平和需求的用户使用:作为供代码开发者使用的Python包、命令行工具、供低代码或无代码用户使用的Napari插件,或作为符合开放容器倡议(OCI)的容器,可用于云计算或超级计算工作流程,旨在处理大数据集。此外,Nyxus还支持一种新的特征提取方法,允许对多个特征集进行程序化调优,以实现最佳计算效率或覆盖,适用于新颖的机器学习和深度学习应用。
cs.CV / 89 / 2603.12036
Single Pixel Image Classification using an Ultrafast Digital Light Projector
使用超快速数字光投影仪的单像素图像分类
Abstract
Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.
Chinese Translation
模式识别和图像分类是机器视觉中的重要任务。例如,自动驾驶车辆需要能够收集变化环境中包含的复杂信息并实时进行分类。在此,我们通过实验展示了结合单像素成像(Single Pixel Imaging, SPI)技术与低复杂度机器学习模型的多千赫兹帧率图像分类。使用微型LED-在CMOS数字光投影仪进行SPI,使得亚毫秒图像编码的超快速模式生成成为可能。我们研究了我们的实验系统在广泛接受的基准任务MNIST数字分类中的分类准确性。我们比较了两种机器学习模型的分类性能:极限学习机(Extreme Learning Machine, ELM)和反向传播训练的深度神经网络。两种模型的复杂性都保持在较低水平,因此添加到推理时间的开销与图像生成时间相当。至关重要的是,我们的单像素图像分类方法基于信息的时空变换,完全绕过了图像重建的需求。通过探索基于SPI的ELM作为二分类器的性能,我们展示了其在超快速成像场景中高效异常检测的潜力。
cs.CV / 90 / 2603.12055
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
通过语义几何保持实现的视觉-语言模型持续学习
Abstract
Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.
Chinese Translation
预训练视觉-语言模型(VLMs)的持续学习容易遭遇灾难性遗忘,然而当前的方法在适应新任务时并未明确保持从预训练和之前阶段继承的跨模态语义几何,导致新任务的监督引发几何失真。我们观察到,最明显的漂移往往集中在旧-新语义接口附近的脆弱邻域,在这些区域,共享的视觉模式容易被新的文本语义重新解释。为了解决这一问题,我们在无样本约束下提出了持续学习的语义几何保持(SeGP-CL)。SeGP-CL首先通过构建一组紧凑的对抗锚点,利用双目标投影梯度下降(DPGD)探测易漂移区域,该方法将选定的新任务种子引导至旧类别语义,同时在原始视觉空间中保持忠实。在训练过程中,我们通过锚点引导的跨模态几何蒸馏(ACGD)保持跨模态结构,并通过轻量级文本语义-几何正则化(TSGR)在任务间稳定文本参考框架。训练后,我们估计锚点引起的原始空间漂移,以转移旧的视觉原型,并通过融合跨模态和视觉线索进行双路径推理。在五个持续学习基准上的大量实验表明,SeGP-CL始终提高了稳定性和前向迁移,达到了最先进的性能,同时更好地保持了VLMs的语义几何。
cs.CV / 91 / 2603.12057
Coarse-Guided Visual Generation via Weighted h-Transform Sampling
通过加权 h-变换采样进行粗略引导的视觉生成
Abstract
Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.
Chinese Translation
粗略引导的视觉生成是从降级或低保真粗略参考中合成精细视觉样本的过程,对于各种现实世界应用至关重要。尽管基于训练的方法有效,但由于配对数据收集的限制,它们在训练成本高和泛化能力受限方面固有不足。因此,最近的无训练方法提出利用预训练的扩散模型,并在采样过程中引入引导。然而,这些无训练方法要么需要知道前向(从精细到粗略)变换算子,例如双三次下采样,要么难以在引导与合成质量之间取得平衡。为了解决这些挑战,我们提出了一种新颖的引导方法,使用 h-变换,这是一种可以在期望条件下约束随机过程(例如采样过程)的工具。具体而言,我们通过在每个采样时间步的原始微分方程中添加漂移函数来修改转移概率,从而大致引导生成过程朝向理想的精细样本。为了解决不可避免的近似误差,我们引入了一种噪声水平感知的调度策略,随着误差的增加逐渐降低该项的权重,从而确保引导遵循和高质量合成。针对多种图像和视频生成任务的广泛实验表明了我们方法的有效性和泛化能力。
cs.CV / 92 / 2603.12063
NBAvatar: Neural Billboards Avatars with Realistic Hand-Face Interaction
NBAvatar:具有真实手脸交互的神经广告牌头像
Abstract
We present NBAvatar - a method for realistic rendering of head avatars handling non-rigid deformations caused by hand-face interaction. We introduce a novel representation for animated avatars by combining the training of oriented planar primitives with neural rendering. Such a combination of explicit and implicit representations enables NBAvatar to handle temporally and pose-consistent geometry, along with fine-grained appearance details provided by the neural rendering technique. In our experiments, we demonstrate that NBAvatar implicitly learns color transformations caused by face-hand interactions and surpasses existing approaches in terms of novel-view and novel-pose rendering quality. Specifically, NBAvatar achieves up to 30% LPIPS reduction under high-resolution megapixel rendering compared to Gaussian-based avatar methods, while also improving PSNR and SSIM, and achieves higher structural similarity compared to the state-of-the-art hand-face interaction method InteractAvatar.
Chinese Translation
我们提出了NBAvatar——一种处理手脸交互引起的非刚性变形的头部头像的真实渲染方法。我们通过结合定向平面原件的训练与神经渲染,介绍了一种新的动画头像表示方法。这种显式与隐式表示的结合使得NBAvatar能够处理时间和姿态一致的几何形状,以及神经渲染技术提供的细致外观细节。在我们的实验中,我们展示了NBAvatar隐式学习由面部与手部交互引起的颜色变换,并在新视角和新姿态渲染质量方面超越了现有方法。具体而言,与基于高斯的头像方法相比,NBAvatar在高分辨率百万像素渲染下实现了高达30%的LPIPS降低,同时提高了PSNR和SSIM,并且在结构相似性方面优于最先进的手脸交互方法InteractAvatar。
cs.CV / 93 / 2603.12064
Dense Dynamic Scene Reconstruction and Camera Pose Estimation from Multi-View Videos
基于多视角视频的密集动态场景重建与相机位姿估计
Abstract
We address the challenging problem of dense dynamic scene reconstruction and camera pose estimation from multiple freely moving cameras -- a setting that arises naturally when multiple observers capture a shared event. Prior approaches either handle only single-camera input or require rigidly mounted, pre-calibrated camera rigs, limiting their practical applicability. We propose a two-stage optimization framework that decouples the task into robust camera tracking and dense depth refinement. In the first stage, we extend single-camera visual SLAM to the multi-camera setting by constructing a spatiotemporal connection graph that exploits both intra-camera temporal continuity and inter-camera spatial overlap, enabling consistent scale and robust tracking. To ensure robustness under limited overlap, we introduce a wide-baseline initialization strategy using feed-forward reconstruction models. In the second stage, we refine depth and camera poses by optimizing dense inter- and intra-camera consistency using wide-baseline optical flow. Additionally, we introduce MultiCamRobolab, a new real-world dataset with ground-truth poses from a motion capture system. Finally, we demonstrate that our method significantly outperforms state-of-the-art feed-forward models on both synthetic and real-world benchmarks, while requiring less memory.
Chinese Translation
我们解决了从多个自由移动的相机中进行密集动态场景重建和相机位姿估计的挑战性问题——这一设置在多个观察者捕捉共享事件时自然出现。以往的方法要么仅处理单相机输入,要么需要刚性安装的、预先标定的相机设备,这限制了它们的实际应用性。我们提出了一种两阶段优化框架,将任务解耦为鲁棒的相机跟踪和密集深度优化。在第一阶段,我们通过构建一个时空连接图,将单相机视觉SLAM扩展到多相机设置,该图利用了相机内部的时间连续性和相机之间的空间重叠,从而实现一致的尺度和鲁棒的跟踪。为了在重叠有限的情况下确保鲁棒性,我们引入了一种使用前馈重建模型的宽基线初始化策略。在第二阶段,我们通过优化密集的相机间和相机内一致性,利用宽基线光流来细化深度和相机位姿。此外,我们引入了MultiCamRobolab,一个新的真实世界数据集,其中包含来自运动捕捉系统的真实位姿。最后,我们展示了我们的方法在合成和真实世界基准测试中显著优于最先进的前馈模型,同时所需内存更少。
cs.CV / 94 / 2603.12067
Beyond Convolution: A Taxonomy of Structured Operators for Learning-Based Image Processing
超越卷积:基于学习的图像处理结构化算子的分类
Abstract
The convolution operator is the fundamental building block of modern convolutional neural networks (CNNs), owing to its simplicity, translational equivariance, and efficient implementation. However, its structure as a fixed, linear, locally-averaging operator limits its ability to capture structured signal properties such as low-rank decompositions, adaptive basis representations, and non-uniform spatial dependencies. This paper presents a systematic taxonomy of operators that extend or replace the standard convolution in learning-based image processing pipelines. We organise the landscape of alternative operators into five families: (i) decomposition-based operators, which separate structural and noise components through singular value or tensor decompositions; (ii) adaptive weighted operators, which modulate kernel contributions as a function of spatial position or signal content; (iii) basis-adaptive operators, which optimise the analysis bases together with the network weights; (iv) integral and kernel operators, which generalise the convolution to position-dependent and non-linear kernels; and (v) attention-based operators, which relax the locality assumption entirely. For each family, we provide a formal definition, a discussion of its structural properties with respect to the convolution, and a critical analysis of the tasks for which the operator is most appropriate. We further provide a comparative analysis of all families across relevant dimensions -- linearity, locality, equivariance, computational cost, and suitability for image-to-image and image-to-label tasks -- and outline the open challenges and future directions of this research area.
Chinese Translation
卷积算子是现代卷积神经网络(CNNs)的基本构建块,因其简单性、平移等变性和高效实现而受到广泛应用。然而,作为一种固定的、线性的、局部平均的算子,其结构限制了其捕捉结构化信号特性的能力,例如低秩分解、自适应基表示和非均匀空间依赖性。本文提出了一种系统的算子分类,旨在扩展或替代学习型图像处理流程中的标准卷积。我们将替代算子的领域组织为五个类别:(i)基于分解的算子,通过奇异值或张量分解分离结构和噪声成分;(ii)自适应加权算子,根据空间位置或信号内容调节核的贡献;(iii)基于基的自适应算子,优化分析基与网络权重;(iv)积分和核算子,将卷积推广到位置依赖和非线性核;(v)基于注意力的算子,完全放宽局部性假设。对于每个类别,我们提供了正式定义,讨论其相对于卷积的结构特性,并对该算子最适合的任务进行批判性分析。我们还提供了所有类别在相关维度(线性、局部性、等变性、计算成本以及适用于图像到图像和图像到标签任务的适用性)上的比较分析,并概述了该研究领域的开放挑战和未来方向。
cs.CV / 95 / 2603.12071
Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
LoV3D:通过区域体积评估在纵向三维脑MRI中基础认知预后推理
Abstract
Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.
Chinese Translation
纵向脑MRI对于表征神经疾病(如阿尔茨海默病)的进展至关重要。然而,当前的深度学习工具将这一过程碎片化:分类器将扫描结果简化为标签,体积管道产生未解释的测量结果,而视觉语言模型(VLMs)可能生成流畅但潜在的幻觉结论。我们提出了LoV3D,一个用于训练三维视觉语言模型的管道,该管道读取纵向T1加权脑MRI,生成区域级解剖评估,与先前扫描进行纵向比较,最终输出三类诊断(认知正常、轻度认知障碍或痴呆)以及综合诊断摘要。该分步管道通过强制标签一致性、纵向连贯性和生物学合理性来巩固最终诊断,从而降低幻觉的风险。训练过程中引入了一个临床加权的验证器,该验证器根据来自标准化体积指标的规范参考自动评分候选输出,推动直接偏好优化,而无需任何人工注释。在一个被保留的ADNI测试集(479个扫描,258个受试者)上,LoV3D实现了93.7%的三类诊断准确率(比无基础模型提高34.8%),97.2%的两类诊断准确率(比现有技术提高4%),以及82.6%的区域级解剖分类准确率(比VLM基线提高33.1%)。零样本迁移在MIRIAD上达到了95.4%(100%的痴呆召回率),在AIBL上达到了82.9%的三类准确率,确认了在不同站点、扫描仪和人群中的高泛化能力。代码可在 https://github.com/Anonymous-TEVC/LoV-3D 获取。
cs.CV / 96 / 2603.12078
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
Node-RF:基于神经常微分方程的广义连续时空场景动态学习
Abstract
Predicting scene dynamics from visual observations is challenging. Existing methods capture dynamics only within observed boundaries failing to extrapolate far beyond the training sequence. Node-RF (Neural ODE-based NeRF) overcomes this limitation by integrating Neural Ordinary Differential Equations (NODEs) with dynamic Neural Radiance Fields (NeRFs), enabling a continuous-time, spatiotemporal representation that generalizes beyond observed trajectories at constant memory cost. From visual input, Node-RF learns an implicit scene state that evolves over time via an ODE solver, propagating feature embeddings via differential calculus. A NeRF-based renderer interprets calculated embeddings to synthesize arbitrary views for long-range extrapolation. Training on multiple motion sequences with shared dynamics allows for generalization to unseen conditions. Our experiments demonstrate that Node-RF can characterize abstract system behavior without explicit model to identify critical points for future predictions.
Chinese Translation
从视觉观察中预测场景动态是一个具有挑战性的任务。现有方法仅在观察到的边界内捕捉动态,无法在训练序列之外进行外推。Node-RF(基于神经常微分方程的NeRF)通过将神经常微分方程(NODEs)与动态神经辐射场(NeRFs)结合,克服了这一限制,使得在恒定内存成本下能够实现超越观察轨迹的连续时间时空表示。Node-RF从视觉输入中学习一个隐式场景状态,该状态通过ODE求解器随时间演变,通过微分计算传播特征嵌入。基于NeRF的渲染器解释计算出的嵌入,以合成任意视角进行长距离外推。在多个共享动态的运动序列上进行训练,使其能够推广到未见条件。我们的实验表明,Node-RF能够表征抽象系统行为,而无需明确模型来识别未来预测的关键点。
cs.CV / 97 / 2603.12083
Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis
面向摄影摄像机的通用计算像差校正:全面基准分析
Abstract
Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at https://github.com/XiaolongQian/UniCAC.
Chinese Translation
当前流行的计算像差校正(CAC)方法通常针对特定光学系统进行调整,导致其泛化能力差,并且在新镜头上需要耗时的重新训练。开发能够在不同摄影镜头之间泛化的CAC范式为解决这些挑战提供了有希望的方案。然而,由于缺乏涵盖足够广泛光学像差的全面基准,消费者摄影领域内实现这种跨镜头的通用性仍处于早期阶段。此外,现有CAC方法受哪些具体因素影响以及这些因素如何影响其性能仍不明确。在本文中,我们展示了涉及24种图像恢复和CAC算法的全面实验和评估,利用我们新提出的UniCAC,这是通过自动光学设计构建的大规模摄影摄像机基准。我们引入了光学退化评估器(ODE)作为一个新颖框架,以客观评估CAC任务的难度,提供光学像差的可信量化,并实现可靠评估。通过我们的比较分析,我们识别出三个关键因素——先前利用、网络架构和训练策略——对CAC性能的影响最为显著,并进一步探讨了它们各自的影响。我们相信我们的基准、数据集和观察为相关领域提供了基础性见解,并为未来的研究奠定了基础。基准、代码和Zemax文件将可在 https://github.com/XiaolongQian/UniCAC 获取。
cs.CV / 98 / 2603.12108
EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation
EvoTok:通过残差潜在演化实现视觉理解与生成的统一图像标记器
Abstract
The development of unified multimodal large language models (MLLMs) is fundamentally challenged by the granularity gap between visual understanding and generation: understanding requires high-level semantic abstractions, while image generation demands fine-grained pixel-level representations. Existing approaches usually enforce the two supervision on the same set of representation or decouple these two supervision on separate feature spaces, leading to interference and inconsistency, respectively. In this work, we propose EvoTok, a unified image tokenizer that reconciles these requirements through a residual evolution process within a shared latent space. Instead of maintaining separate token spaces for pixels and semantics, EvoTok encodes an image into a cascaded sequence of residual tokens via residual vector quantization. This residual sequence forms an evolution trajectory where earlier stages capture low-level details and deeper stages progressively transition toward high-level semantic representations. Despite being trained on a relatively modest dataset of 13M images, far smaller than the billion-scale datasets used by many previous unified tokenizers, EvoTok achieves a strong reconstruction quality of 0.43 rFID on ImageNet-1K at 256x256 resolution. When integrated with a large language model, EvoTok shows promising performance across 7 out of 9 visual understanding benchmarks, and remarkable results on image generation benchmarks such as GenEval and GenAI-Bench. These results demonstrate that modeling visual representations as an evolving trajectory provides an effective and principled solution for unifying visual understanding and generation.
Chinese Translation
统一多模态大型语言模型(MLLMs)的发展面临着视觉理解与生成之间的粒度差距的根本挑战:理解需要高层次的语义抽象,而图像生成则要求细粒度的像素级表示。现有方法通常在同一组表示上施加这两种监督,或在不同的特征空间中解耦这两种监督,分别导致干扰和不一致。在本研究中,我们提出了EvoTok,一种统一的图像标记器,通过共享潜在空间中的残差演化过程来调和这些要求。EvoTok并不为像素和语义维护独立的标记空间,而是通过残差向量量化将图像编码为一系列级联的残差标记。这个残差序列形成了一条演化轨迹,其中早期阶段捕捉低层次细节,而更深的阶段逐步过渡到高层次的语义表示。尽管仅在相对较小的1300万图像数据集上进行训练,远小于许多之前统一标记器使用的十亿级数据集,EvoTok在256x256分辨率下在ImageNet-1K上实现了0.43的强重建质量。当与大型语言模型集成时,EvoTok在9个视觉理解基准中的7个上表现出良好的性能,并在图像生成基准如GenEval和GenAI-Bench上取得了显著成果。这些结果表明,将视觉表示建模为演化轨迹提供了统一视觉理解与生成的有效且有原则的解决方案。
cs.CV / 99 / 2603.12126
Hoi3DGen: Generating High-Quality Human-Object-Interactions in 3D
Hoi3DGen:在三维中生成高质量的人类-物体交互
Abstract
Modeling and generating 3D human-object interactions from text is crucial for applications in AR, XR, and gaming. Existing approaches often rely on score distillation from text-to-image models, but their results suffer from the Janus problem and do not follow text prompts faithfully due to the scarcity of high-quality interaction data. We introduce Hoi3DGen, a framework that generates high-quality textured meshes of human-object interaction that follow the input interaction descriptions precisely. We first curate realistic and high-quality interaction data leveraging multimodal large language models, and then create a full text-to-3D pipeline, which achieves orders-of-magnitude improvements in interaction fidelity. Our method surpasses baselines by 4-15x in text consistency and 3-7x in 3D model quality, exhibiting strong generalization to diverse categories and interaction types, while maintaining high-quality 3D generation.
Chinese Translation
从文本建模和生成三维人类-物体交互对于增强现实(AR)、扩展现实(XR)和游戏等应用至关重要。现有方法通常依赖于从文本到图像模型的评分蒸馏,但由于高质量交互数据的稀缺,其结果受到雅努斯问题的影响,未能忠实地遵循文本提示。我们提出了Hoi3DGen,一个生成高质量人类-物体交互纹理网格的框架,能够精确遵循输入的交互描述。我们首先利用多模态大型语言模型策划真实且高质量的交互数据,然后创建一个完整的文本到三维的管道,实现了交互保真度的数量级提升。我们的方法在文本一致性上超越基线4-15倍,在三维模型质量上超越3-7倍,展现出对多样类别和交互类型的强泛化能力,同时保持高质量的三维生成。
cs.CV / 100 / 2603.12138
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
HATS:面向GUI代理的难度感知轨迹合成
Abstract
Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.
Chinese Translation
由大型视觉-语言模型(VLMs)驱动的图形用户界面(GUI)代理在自动化数字任务方面展现出显著潜力,这突显了高质量轨迹数据在支持有效代理训练中的必要性。然而,现有的轨迹合成流程往往产生无法超越简单交互的代理。我们认为这一局限性源于对语义模糊动作的忽视,这些动作的含义依赖于上下文、序列或视觉的模糊性。这类动作对于现实世界的鲁棒性至关重要,但在当前数据集中代表性不足且处理不当,导致任务指令与执行之间的语义不一致。为了解决这些问题,我们提出了HATS,一个旨在减轻语义模糊影响的难度感知轨迹合成框架。我们将难度定义为与动作相关的语义模糊程度,并开发了两个互补模块:(1)基于难度的探索,指导数据收集朝向模糊但信息丰富的交互;(2)基于对齐的优化,迭代验证和修复指令与执行之间的对齐。这两个模块在闭环中运作:探索为优化提供具有挑战性的轨迹,而优化反馈则更新难度信号以指导未来的探索。大量实验表明,使用HATS训练的代理在基准GUI环境中始终优于最先进的基线。
cs.CV / 101 / 2603.12144
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
O3N:全向开放词汇占用预测
Abstract
Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360{\deg}. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.
Chinese Translation
通过全向感知理解和重建三维世界是自主智能体和具身智能发展中的一个必然趋势。然而,现有的三维占用预测方法受到有限视角输入和预定义训练分布的限制,难以应用于需要全面且安全感知开放世界探索场景的具身智能体。为此,我们提出了O3N,这是第一个纯视觉的端到端全向开放词汇占用预测框架。O3N通过极螺旋Mamba(Polar-spiral Mamba, PsM)模块将全向体素嵌入极螺旋拓扑,能够在360°范围内实现连续的空间表示和长距离上下文建模。占用成本聚合(Occupancy Cost Aggregation, OCA)模块引入了一种统一几何和语义监督的原则机制,确保重建的几何形状与基础语义结构之间的一致性。此外,自然模态对齐(Natural Modality Alignment, NMA)建立了一条无梯度对齐路径,协调视觉特征、体素嵌入和文本语义,形成一致的“像素-体素-文本”表示三元组。在多个模型上的广泛实验表明,我们的方法不仅在QuadOcc和Human360Occ基准测试中实现了最先进的性能,还展现了显著的跨场景泛化能力和语义可扩展性,为通用三维世界建模铺平了道路。源代码将公开发布在https://github.com/MengfeiD/O3N。
cs.CV / 102 / 2603.12146
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
FlashMotion:基于轨迹引导的少步可控视频生成
Abstract
Recent advances in trajectory-controllable video generation have achieved remarkable progress. Previous methods mainly use adapter-based architectures for precise motion control along predefined trajectories. However, all these methods rely on a multi-step denoising process, leading to substantial time redundancy and computational overhead. While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. We first train a trajectory adapter on a multi-step video generator for precise trajectory control. Then, we distill the generator into a few-step version to accelerate video generation. Finally, we finetune the adapter using a hybrid strategy that combines diffusion and adversarial objectives, aligning it with the few-step generator to produce high-quality, trajectory-accurate videos. For evaluation, we introduce FlashBench, a benchmark for long-sequence trajectory-controllable video generation that measures both video quality and trajectory accuracy across varying numbers of foreground objects. Experiments on two adapter architectures show that FlashMotion surpasses existing video distillation methods and previous multi-step models in both visual quality and trajectory consistency.
Chinese Translation
近年来,轨迹可控视频生成取得了显著进展。以往的方法主要使用基于适配器的架构来沿预定义轨迹进行精确的运动控制。然而,所有这些方法都依赖于多步骤去噪过程,导致了显著的时间冗余和计算开销。尽管现有的视频蒸馏方法成功地将多步骤生成器蒸馏为少步骤生成器,但直接将这些方法应用于轨迹可控视频生成会导致视频质量和轨迹准确性明显下降。为了解决这一问题,我们提出了FlashMotion,一种新颖的训练框架,旨在实现少步骤的轨迹可控视频生成。我们首先在多步骤视频生成器上训练一个轨迹适配器,以实现精确的轨迹控制。然后,我们将生成器蒸馏为少步骤版本,以加速视频生成。最后,我们使用一种结合扩散和对抗目标的混合策略对适配器进行微调,使其与少步骤生成器对齐,从而生成高质量、轨迹准确的视频。为了评估,我们引入了FlashBench,这是一个用于长序列轨迹可控视频生成的基准,测量在不同数量前景物体下的视频质量和轨迹准确性。对两种适配器架构的实验表明,FlashMotion在视觉质量和轨迹一致性方面超越了现有的视频蒸馏方法和以往的多步骤模型。
cs.CV / 103 / 2603.12147
EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
EgoIntent:一个以自我为中心的步级基准,用于理解什么、为什么以及接下来做什么
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.
Chinese Translation
多模态大型语言模型(MLLMs)在多种任务中展示了卓越的视频推理能力。然而,它们在以自我为中心的视频中理解人类意图的细粒度能力仍然 largely 未被探索。现有基准主要集中在情节级意图推理上,忽视了步级意图理解的更细粒度。然而,智能助手、机器人模仿学习和增强现实指导等应用需要理解一个人在每一步所做的事情,不仅要知道是什么,还要知道为什么以及接下来会发生什么,以便提供及时和上下文相关的支持。为此,我们引入了EgoIntent,这是一个针对以自我为中心视频的步级意图理解基准。它包含3,014个步骤,涵盖15种多样的室内和户外日常生活场景,并在三个互补维度上评估模型:局部意图(What)、全局意图(Why)和下一步计划(Next)。重要的是,每个片段在查询步骤的关键结果(例如,接触或抓取)发生之前立即被截断,并且不包含后续步骤的帧,从而防止未来帧泄漏,并使对预期步骤理解和下一步规划的评估更加清晰。我们评估了15个MLLMs,包括最先进的闭源和开源模型。即使是表现最佳的模型在三个意图维度上的平均得分也仅为33.31,强调了在以自我为中心的视频中进行步级意图理解仍然是一个高度具有挑战性的问题,亟需进一步研究。
cs.CV / 104 / 2603.12149
Linking Perception, Confidence and Accuracy in MLLMs
连接感知、自信与准确性在多模态大型语言模型中的关系
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g., Planner, Critic, Voter) to schedule these modules and provide external verification. Our integrated framework establishes new state-of-the-art results with consistent 8.8% gains across four benchmarks. More ablation studies demonstrate the effectiveness of each module and scaling superiority.
Chinese Translation
近期多模态大型语言模型(MLLMs)的进展主要集中在增强视觉感知以提高准确性。然而,一个关键问题仍未得到探索:模型是否知道自己不知道?通过一项探测实验,我们揭示了MLLMs中存在严重的自信度失调问题。为了解决这一问题,我们提出了基于自信的强化学习(Confidence-Driven Reinforcement Learning, CDRL),该方法利用原始噪声图像对和新颖的基于自信的奖励来增强感知敏感性并稳健地校准模型的自信度。除了训练收益外,经过校准的自信度还能够作为一种“免费午餐”实现更有效的测试时间扩展。我们进一步提出了自信感知测试时间扩展(Confidence-Aware Test-Time Scaling, CA-TTS),该方法动态协调由自信信号引导的自我一致性、自我反思和视觉自检模块。一个专家模型在多个角色中发挥作用(例如,规划者、评论者、投票者),以调度这些模块并提供外部验证。我们的集成框架在四个基准测试中实现了新的最先进结果,均匀提升了8.8%。更多的消融研究展示了每个模块的有效性和扩展优势。
cs.CV / 105 / 2603.12155
GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
GlyphBanana:通过自主工作流程推进精确文本渲染
Abstract
Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.
Chinese Translation
尽管生成模型的最新进展在文本渲染方面取得了显著进展,但准确生成复杂文本和数学公式仍然是一个巨大的挑战。这一困难主要源于当前模型在遇到分布外提示时的有限指令遵循能力。为了解决这一问题,我们提出了GlyphBanana,并相应地设计了一个基准,专门用于渲染复杂字符和公式。GlyphBanana采用了一种自主工作流程,集成辅助工具,将字形模板注入潜在空间和注意力图,从而促进生成图像的迭代优化。值得注意的是,我们的无训练方法可以无缝应用于各种文本到图像(Text-to-Image, T2I)模型,达到比现有基准更高的精度。大量实验表明我们提出的工作流程的有效性。相关代码已公开,地址为 https://github.com/yuriYanZeXuan/GlyphBanana。
cs.CV / 106 / 2603.12166
LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning
LatentGeo:用于多模态几何推理的潜在空间可学习辅助构造
Abstract
Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.
Chinese Translation
尽管多模态推理最近取得了进展,但表示辅助几何构造仍然是多模态大型语言模型(MLLMs)面临的一个基本挑战。这些构造在原始图示中缺失,必须在定理适用之前引入。现有方法主要依赖于显式构造范式,包括基于文本的几何规范、推理过程中的视觉符号交错以及工具增强的几何执行。然而,这些方法要么无法忠实地表示复杂的空间关系,要么在离散符号和连续几何结构之间产生表示不匹配,或者依赖于外部能力,从而阻碍了端到端优化。为了解决这些局限性,我们提出了LatentGeo,一个学习连续潜在视觉表示的框架,以在不进行像素级渲染或外部执行的情况下内化辅助几何构造。我们设计了一个三阶段的课程,逐步通过辅助视觉监督对这些潜在表示进行对齐和内化,随后采用LaGDPO,一种潜在感知的强化学习过程,在策略优化期间稳定潜在表示,同时提高最终任务的正确性。为了系统地评估以构造为中心的表示质量,我们引入了GeoAux,一个针对视觉依赖几何问题的新基准,并在GeoAux和MathVerse上进行了实验。结果表明,LatentGeo在几何推理任务上取得了显著的进展,特别是在需要辅助构造的任务上。大量分析和消融研究进一步验证了我们框架中每个组件的有效性。
cs.CV / 107 / 2603.12176
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
BehaviorVLM:统一的无微调行为理解与视觉-语言推理
Abstract
Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
Chinese Translation
理解自由移动动物的行为是神经科学的核心,其中姿态估计和行为理解构成了将神经活动与自然行为联系起来的基础。然而,这两项任务仍然严重依赖人工标注或不稳定的无监督流程,限制了可扩展性和可重复性。我们提出了BehaviorVLM,一个统一的视觉-语言框架,用于姿态估计和行为理解,要求无需特定任务的微调,并通过详细、明确和可验证的推理步骤指导预训练的视觉-语言模型(VLM),仅需最少的人类标注。在姿态估计方面,我们利用量子点基础的行为数据,提出了一个多阶段流程,集成了时间、空间和跨视角推理。这一设计大大减少了人工标注的工作量,通过几何检查(如重投影误差)暴露低置信度标签,并生成可以后续过滤、修正或用于微调下游姿态模型的标签。在行为理解方面,我们提出了一个集成深度嵌入聚类以发现过度分割行为、基于VLM的每段视频字幕生成和基于LLM的推理以合并和语义标注行为片段的流程。该行为流程可以直接从视觉信息中操作,无需关键点来分割行为。这些组件共同实现了对多动物行为的可扩展、可解释和轻标注分析。
cs.CV / 108 / 2603.12208
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
ForensicZip:更多的标记更好,但在法医视觉语言模型中并非必要
Abstract
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
Chinese Translation
多模态大型语言模型(MLLMs)通过生成文本推理来实现可解释的多媒体取证,以检测伪造。然而,处理密集的视觉序列会产生高计算成本,特别是在高分辨率图像和视频中。视觉标记修剪是一种实用的加速策略,但现有方法主要是基于语义的,保留显著对象而丢弃背景区域,而这些区域往往包含操控痕迹,如高频异常和时间抖动。为了解决这个问题,我们提出了ForensicZip,这是一个无训练的框架,从伪造驱动的角度重新定义标记压缩。ForensicZip将时间标记演变建模为带有松弛虚拟节点的出生-死亡最优运输问题,量化指示瞬态生成伪影的物理不连续性。法医评分进一步将基于运输的新颖性与高频先验结合,以在大比例压缩下将法医证据与语义内容分离。在深伪和AIGC基准上的实验表明,在10%的标记保留下,ForensicZip实现了$2.97 imes$的加速和超过90%的FLOPs减少,同时保持了最先进的检测性能。
cs.CV / 109 / 2603.12215
RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images
RDNet:区域比例感知动态自适应显著目标检测网络在光学遥感图像中的应用
Abstract
Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.
Chinese Translation
在遥感图像中,显著目标检测(SOD)面临着由于目标尺寸的巨大变化、自注意力机制的计算成本以及基于卷积神经网络(CNN)的特征提取器在捕捉全局上下文和长距离依赖方面的局限性而带来的重大挑战。现有依赖固定卷积核的方法通常难以适应多样化的目标尺度,导致细节丢失或无关特征的聚合。为了解决这些问题,本研究旨在增强对尺度变化的鲁棒性并实现精确的目标定位。我们提出了区域比例感知动态自适应显著目标检测网络(RDNet),该网络用SwinTransformer替代了CNN主干以进行全局上下文建模,并引入了三个关键模块:(1)动态自适应细节感知(DAD)模块,通过目标区域比例引导应用不同的卷积核;(2)频率匹配上下文增强(FCE)模块,通过小波交互和注意力丰富上下文信息;(3)区域比例感知定位(RPL)模块,采用交叉注意力突出语义细节,并集成比例引导(PG)块以辅助DAD模块。通过结合这些模块,RDNet在尺度变化下实现了鲁棒性和准确的定位,与最先进的方法相比,提供了更优越的检测性能。
cs.CV / 110 / 2603.12217
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
基于验证者引导的伪标签的真实场景点跟踪
Abstract
Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r
Chinese Translation
长期点跟踪模型通常在大型合成数据集上进行训练。然而,由于真实视频的特征差异和缺乏密集的真实标注,这些模型的性能会下降。自我训练未标注视频被探索作为一种实用解决方案,但伪标签的质量在很大程度上依赖于教师模型的可靠性,而这在不同的帧和场景中会有所变化。在本文中,我们解决了真实场景微调的问题,并引入了验证者(verifier),一个元模型,学习评估跟踪器预测的可靠性并指导伪标签的生成。给定来自多个预训练跟踪器的候选轨迹,验证者逐帧评估这些轨迹并选择最可信的预测,从而生成高质量的伪标签轨迹。在微调中应用时,基于验证者引导的伪标签显著提高了监督的质量,并使得对未标注视频的数据高效适应。对四个真实场景基准的广泛实验表明,我们的方法在所需数据量少于先前自我训练方法的情况下,达到了最先进的结果。项目页面:https://kuis-ai.github.io/track_on_r
cs.CV / 111 / 2603.12221
A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
一种两阶段双模态模型用于面部情感表达识别
Abstract
This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.
Chinese Translation
本文针对第十届情感行为分析野外(ABAW)研讨会和竞赛中的表情(EXPR)识别挑战,该挑战要求对来自无约束视频的八种面部情感表达进行帧级分类。由于面部定位不准确、大幅度的姿态和尺度变化、运动模糊、时间不稳定性以及相邻帧之间的其他干扰因素,这一任务具有挑战性。我们提出了一种两阶段双模态(音频-视觉)模型来应对这些困难。第一阶段专注于使用预训练的基于DINOv2的编码器进行稳健的视觉特征提取。具体而言,DINOv2 ViT-L/14被用作主干,采用了一种考虑填充的增强(PadAug)策略用于图像填充和原始视频的数据预处理,并引入了一种专家混合(MoE)训练头以增强分类器的多样性。第二阶段解决模态融合和时间一致性问题。对于视觉模态,从原始视频中以多种尺度重新裁剪面部,并将提取的视觉特征进行平均,以形成稳健的帧级表示。同时,从短音频窗口中提取帧对齐的Wav2Vec 2.0音频特征,以提供互补的声学线索。这些双模态特征通过一个轻量级的门控融合模块进行整合,随后进行推理时的时间平滑。在ABAW数据集上的实验表明了所提方法的有效性。该两阶段模型在官方验证集上获得了0.5368的Macro-F1分数,并在5折交叉验证下达到了0.5122 +/- 0.0277,优于官方基准。
cs.CV / 112 / 2603.12222
HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers
HiAP:一种用于视觉变换器的多粒度随机自剪枝框架
Abstract
Vision Transformers require significant computational resources and memory bandwidth, severely limiting their deployment on edge devices. While recent structured pruning methods successfully reduce theoretical FLOPs, they typically operate at a single structural granularity and rely on complex, multi-stage pipelines with post-hoc thresholding to satisfy sparsity budgets. In this paper, we propose Hierarchical Auto-Pruning (HiAP), a continuous relaxation framework that discovers optimal sub-networks in a single end-to-end training phase without requiring manual importance heuristics or predefined per-layer sparsity targets. HiAP introduces stochastic Gumbel-Sigmoid gates at multiple granularities: macro-gates to prune entire attention heads and FFN blocks, and micro-gates to selectively prune intra-head dimensions and FFN neurons. By optimizing both levels simultaneously, HiAP addresses both the memory-bound overhead of loading large matrices and the compute-bound mathematical operations. HiAP naturally converges to stable sub-networks using a loss function that incorporates both structural feasibility penalties and analytical FLOPs. Extensive experiments on ImageNet demonstrate that HiAP organically discovers highly efficient architectures, and achieves a competitive accuracy-efficiency Pareto frontier for models like DeiT-Small, matching the performance of sophisticated multi-stage methods while significantly simplifying the deployment pipeline.
Chinese Translation
视觉变换器需要大量的计算资源和内存带宽,严重限制了它们在边缘设备上的部署。尽管最近的结构化剪枝方法成功地减少了理论浮点运算量(FLOPs),但它们通常在单一结构粒度下操作,并依赖复杂的多阶段管道和事后阈值设定来满足稀疏预算。本文提出了分层自剪枝(HiAP),这是一种连续松弛框架,可以在单个端到端训练阶段发现最佳子网络,而无需手动重要性启发式或预定义的每层稀疏目标。HiAP在多个粒度上引入了随机Gumbel-Sigmoid门:宏门用于剪枝整个注意力头和前馈网络(FFN)块,微门用于选择性剪枝头内部维度和FFN神经元。通过同时优化这两个层次,HiAP解决了加载大矩阵的内存限制开销和计算密集型数学运算的问题。HiAP通过一个同时考虑结构可行性惩罚和分析FLOPs的损失函数,自然收敛到稳定的子网络。在ImageNet上的大量实验表明,HiAP有机地发现了高效的架构,并为像DeiT-Small这样的模型实现了具有竞争力的准确性-效率帕累托前沿,匹配了复杂多阶段方法的性能,同时显著简化了部署管道。
cs.CV / 113 / 2603.12238
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
SceneAssistant:一种用于开放词汇3D场景生成的视觉反馈代理
Abstract
Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
Chinese Translation
从自然语言生成3D场景的文本到3D场景转换在数字内容创作中极具价值。然而,现有方法在很大程度上受限于特定领域或依赖于预定义的空间关系,限制了其在无约束、开放词汇3D场景合成中的能力。本文介绍了SceneAssistant,一种旨在开放词汇3D场景生成的视觉反馈驱动代理。我们的框架利用现代3D物体生成模型以及视觉-语言模型(Vision-Language Models, VLMs)的空间推理和规划能力。为了实现开放词汇场景构建,我们为VLM提供了一套全面的原子操作(例如,缩放、旋转、聚焦)。在每个交互步骤中,VLM接收渲染的视觉反馈并据此采取行动,迭代地优化场景,以实现更连贯的空间排列和更好地与输入文本对齐。实验结果表明,我们的方法能够生成多样化、开放词汇和高质量的3D场景。定性分析和定量人类评估均表明我们的方法优于现有方法。此外,我们的方法允许用户根据自然语言命令指导代理编辑现有场景。我们的代码可在 https://github.com/ROUJINN/SceneAssistant 获取。
cs.CV / 114 / 2603.12240
BiGain: Unified Token Compression for Joint Generation and Classification
BiGain:用于联合生成和分类的统一令牌压缩
Abstract
Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.
Chinese Translation
扩散模型的加速方法(例如,令牌合并或下采样)通常在减少计算的情况下优化合成质量,但往往忽视了区分能力。我们重新审视令牌压缩,提出了BiGain,这是一种无训练、即插即用的框架,能够在加速的扩散模型中保持生成质量的同时改善分类性能。我们的关键见解是频率分离:将特征空间信号映射为频率感知表示,能够将细节与全局语义解耦,从而实现尊重生成保真度和区分效用的压缩。BiGain通过两个频率感知操作体现了这一原则:(1)拉普拉斯门控令牌合并,鼓励在光谱平滑的令牌之间进行合并,同时抑制高对比度令牌的合并,从而保留边缘和纹理;(2)插值-外推键值下采样,通过在最近邻和平均池化之间进行可控的插值外推来下采样键/值,同时保持查询不变,从而保持注意力精度。在基于DiT和U-Net的骨干网络以及ImageNet-1K、ImageNet-100、Oxford-IIIT Pets和COCO-2017上,我们的操作一致改善了基于扩散的分类的速度-准确性权衡,同时在可比加速下保持或提高生成质量。例如,在ImageNet-1K上,使用Stable Diffusion 2.0进行70%的令牌合并,BiGain将分类准确率提高了7.15%,同时FID改善了0.34(1.85%)。我们的分析表明,平衡的光谱保留,保留高频细节和低/中频语义,是扩散模型中令牌压缩的可靠设计原则。据我们所知,BiGain是第一个在加速扩散下联合研究和推进生成与分类的框架,支持更低成本的部署。
cs.CV / 115 / 2603.12245
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
一个模型,多种预算:扩展潜在接口用于扩散变换器
Abstract
Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: https://snap-research.github.io/elit/
Chinese Translation
扩散变换器(DiTs)实现了高生成质量,但将浮点运算数(FLOPs)锁定在图像分辨率上,限制了原则性的延迟-质量权衡,并且在输入空间标记之间均匀分配计算,导致对不重要区域的资源分配浪费。我们提出了扩展潜在接口变换器(Elastic Latent Interface Transformer, ELIT),这是一种兼容DiT的可插拔机制,能够将输入图像大小与计算解耦。我们的方法插入了一个潜在接口,这是一个可学习的可变长度标记序列,标准变换器模块可以在其上操作。轻量级的读写交叉注意力层在空间标记和潜在之间移动信息,并优先考虑重要的输入区域。通过随机丢弃尾部潜在,ELIT学习生成重要性排序的表示,早期的潜在捕捉全局结构,而后期的潜在则包含细节精炼的信息。在推理时,潜在的数量可以动态调整以匹配计算约束。ELIT的设计故意简约,仅增加了两个交叉注意力层,同时保持了修正流目标和DiT堆栈不变。在不同的数据集和架构(DiT、U-ViT、HDiT、MM-DiT)上,ELIT均提供了一致的性能提升。在ImageNet-1K 512px上,ELIT在FID和FDD分数上分别平均提升了$35.3\%$和$39.6\\%$。项目页面:https://snap-research.github.io/elit/
cs.CV / 116 / 2603.12247
Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
信任你的评论者:用于忠实图像编辑和生成的稳健奖励建模与强化学习
Abstract
Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.
Chinese Translation
强化学习(RL)已成为增强图像编辑和文本到图像(T2I)生成的有前景的范式。然而,目前的奖励模型在RL中作为评论者时,常常受到幻觉的影响,并分配噪声评分,从而本质上误导了优化过程。本文提出了FIRM(忠实图像奖励建模),这是一个综合框架,旨在开发稳健的奖励模型,为忠实的图像生成和编辑提供准确可靠的指导。首先,我们设计了定制的数据策划管道,以构建高质量的评分数据集。具体而言,我们通过执行和一致性来评估编辑,而生成主要通过遵循指令进行评估。利用这些管道,我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集,并训练了专门的奖励模型(FIRM-Edit-8B和FIRM-Gen-8B),这些模型准确反映了这些标准。其次,我们引入了FIRM-Bench,这是一个专门为编辑和生成评论者设计的综合基准。评估表明,我们的模型在与人类判断的一致性方面优于现有指标。此外,为了将这些评论者无缝集成到RL管道中,我们制定了一种新颖的“基础与奖励”奖励策略,以平衡相互竞争的目标:用于编辑的一致性调制执行(CME)和用于生成的质量调制对齐(QMA)。在这一框架的支持下,我们的模型FIRM-Qwen-Edit和FIRM-SD3.5实现了显著的性能突破。全面的实验表明,FIRM减轻了幻觉,建立了对比现有通用模型的新标准,确保忠实性和指令遵循。我们的所有数据集、模型和代码已公开发布在https://firm-reward.github.io。
cs.CV / 117 / 2603.12250
DVD: Deterministic Video Depth Estimation with Generative Priors
DVD:具有生成先验的确定性视频深度估计
Abstract
Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.
Chinese Translation
现有的视频深度估计面临一个基本的权衡:生成模型遭受随机几何幻觉和尺度漂移,而判别模型则需要大量标注数据集以解决语义模糊。为了打破这一僵局,我们提出了DVD,这是第一个将预训练视频扩散模型确定性地转化为单次深度回归器的框架。具体而言,DVD具有三个核心设计:(i)将扩散时间步重新利用为结构锚点,以平衡全局稳定性与高频细节;(ii)潜在流形校正(Latent Manifold Rectification, LMR)以减轻回归引起的过度平滑,强制施加微分约束以恢复清晰的边界和连贯的运动;(iii)全局仿射一致性,这是一种固有属性,限制了窗口间的发散,使得无需复杂的时间对齐即可实现无缝的长视频推断。大量实验表明,DVD在各基准测试中实现了最先进的零-shot性能。此外,DVD成功地利用比领先基线少163倍的任务特定数据,解锁了视频基础模型中隐含的深刻几何先验。值得注意的是,我们全面发布了我们的管道,提供了用于最先进视频深度估计的完整训练套件,以惠及开源社区。
cs.CV / 118 / 2603.12252
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
EndoCoT:在扩散模型中扩展内生链式思维推理
Abstract
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.
Chinese Translation
近年来,多模态大型语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来处理空间推理等复杂任务。然而,这一范式存在两个关键限制:(i)MLLMs文本编码器的推理深度不足。单步编码无法激活链式思维过程,而这一过程对于MLLMs提供复杂任务的准确指导至关重要。(ii)在解码过程中指导保持不变。在解码过程中不变的指导阻止了DiT逐步将复杂指令分解为可操作的去噪步骤,即使有正确的MLLM编码。为此,我们提出了内生链式思维(EndoCoT),这是一个新颖的框架,首先通过迭代思维指导模块迭代地细化潜在思维状态,从而激活MLLM的推理潜力,然后将这些状态与DiT的去噪过程连接起来。其次,应用终端思维基础模块,以确保推理轨迹在文本监督中保持基础,通过将最终状态与真实答案对齐。通过这两个组件,MLLM文本编码器提供经过深思熟虑的指导,使DiT能够逐步执行并最终以逐步的方式解决复杂任务。在多个基准(如迷宫、旅行商问题、可视化空间问题和数独)上的广泛评估中,平均准确率达到92.1%,比最强基线高出8.3个百分点。
cs.CV / 119 / 2603.12254
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
注意力之前的注意:通过自回归注视实现高效且可扩展的视频理解
Abstract
Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.
Chinese Translation
多模态大型语言模型(MLLMs)在通用视频理解方面取得了进展,但在处理长时间、高分辨率视频时却面临困难——它们在视觉变换器(ViTs)或大型语言模型(LLMs)中对每个像素的处理是平等的,尽管存在显著的时空冗余。我们提出了AutoGaze,一个轻量级模块,在通过ViT或MLLM处理之前去除冗余补丁。AutoGaze通过下一个标记预测和强化学习进行训练,自回归地选择一组最小的多尺度补丁,这些补丁能够在用户指定的误差阈值内重构视频,消除冗余同时保留信息。实证结果表明,AutoGaze将视觉标记减少了4倍至100倍,并使ViTs和MLLMs的速度提高了最多19倍,从而使MLLMs能够扩展到1K帧4K分辨率的视频,并在视频基准测试中取得了优异的结果(例如,在VideoMME上达到67.0%)。此外,我们还介绍了HLVid:第一个高分辨率、长格式视频问答基准,包含5分钟的4K分辨率视频,其中使用AutoGaze扩展的MLLM在基线之上提高了10.1%,并超越了之前最佳的MLLM 4.5%。项目页面:https://autogaze.github.io/
cs.CV / 120 / 2603.12255
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
Spatial-TTT:基于视觉的空间智能流媒体与测试时训练
Abstract
Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.
Chinese Translation
人类通过一系列视觉观察来感知和理解现实世界的空间。因此,从潜在无限的视频流中流式维护和更新空间证据的能力对于空间智能至关重要。核心挑战不仅在于更长的上下文窗口,而在于如何选择、组织和保留空间信息。本文提出了Spatial-TTT,旨在通过测试时训练(TTT)实现基于视觉的空间智能流媒体,该方法调整一部分参数(快速权重)以捕捉和组织长时间场景视频中的空间证据。具体而言,我们设计了一种混合架构,并采用与滑动窗口注意力并行的大块更新,以实现高效的空间视频处理。为了进一步促进空间意识,我们引入了一种应用于TTT层的空间预测机制,结合3D时空卷积,鼓励模型捕捉帧之间的几何对应关系和时间连续性。除了架构设计外,我们构建了一个具有密集3D空间描述的数据集,指导模型更新其快速权重,以结构化的方式记忆和组织全球3D空间信号。大量实验表明,Spatial-TTT提高了长时间空间理解,并在视频空间基准测试中达到了最先进的性能。项目页面:https://liuff19.github.io/Spatial-TTT。
cs.CV / 121 / 2603.12257
DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning
DreamVideo-Omni:基于潜在身份强化学习的全动控多主体视频定制
Abstract
While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.
Chinese Translation
尽管大规模扩散模型已经彻底改变了视频合成,但在多主体身份和多粒度运动的精确控制方面仍然面临重大挑战。最近的尝试虽然试图填补这一空白,但往往受到运动粒度有限、控制模糊和身份退化的影响,导致在身份保持和运动控制方面的表现不尽如人意。在本研究中,我们提出了DreamVideo-Omni,一个统一框架,通过渐进的两阶段训练范式,实现和谐的多主体定制与全动控。在第一阶段,我们整合了全面的控制信号进行联合训练,包括主体外观、全局运动、局部动态和相机运动。为了确保稳健和精确的可控性,我们引入了一种条件感知的3D旋转位置嵌入,以协调异构输入,并采用分层运动注入策略来增强全局运动指导。此外,为了解决多主体模糊问题,我们引入了组和角色嵌入,将运动信号明确锚定到特定身份,有效地将复杂场景解缠成独立的可控实例。在第二阶段,为了减轻身份退化,我们设计了一种潜在身份奖励反馈学习范式,通过在预训练的视频扩散骨干上训练潜在身份奖励模型。这在潜在空间中提供了运动感知的身份奖励,优先考虑与人类偏好一致的身份保持。在我们精心策划的大规模数据集和全面的DreamOmni基准的支持下,DreamVideo-Omni在生成高质量视频和精确可控性方面表现出色。
cs.CV / 122 / 2603.12262
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
视频流思维:视频大型语言模型可以同时观看和思考
Abstract
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.
Chinese Translation
在线视频大型语言模型(VideoLLMs)在支持响应式实时交互中发挥着关键作用。现有方法专注于流媒体感知,缺乏同步的逻辑推理流。然而,直接应用测试时扩展方法会导致不可接受的响应延迟。为了解决这一权衡,我们提出了视频流思维(Video Streaming Thinking, VST),一种用于流媒体视频理解的新范式。它支持在观看时思考的机制,在流媒体播放过程中激活对输入视频片段的推理。这一设计在保持实时响应性的同时,通过在视频播放过程中摊销大型语言模型(LLM)推理延迟,改善了及时理解和连贯认知。此外,我们引入了一个全面的后训练管道,整合了VST-SFT,该方法结构性地将离线VideoLLM适配为因果流媒体推理,以及VST-RL,通过在多轮视频交互环境中的自我探索提供端到端的改进。此外,我们设计了一个自动化训练数据合成管道,利用视频知识图谱生成高质量的流媒体问答对,并通过基于实体-关系的流媒体思维链(Chain-of-Thought)来强化多证据推理和对视频流的持续关注。广泛的评估表明,VST-7B在在线基准测试中表现出色,例如在StreamingBench上达到79.5%和在OVO-Bench上达到59.3%。同时,VST在离线长篇或推理基准测试中仍具竞争力。与Video-R1相比,VST的响应速度快15.7倍,并在VideoHolmes上实现了+5.4%的提升,展示了在多样化视频理解任务中的更高效率和强泛化能力。代码、数据和模型将发布在https://github.com/1ranGuan/VST。
cs.CV / 123 / 2603.12264
GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing
GRADE:图像编辑中基于学科的推理基准测试
Abstract
Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.
Chinese Translation
统一的多模态模型旨在实现联合理解、推理和生成,但当前的图像编辑基准主要局限于自然图像和浅层常识推理,无法在结构化的领域特定约束下充分评估这一能力。在本研究中,我们引入了GRADE,这是第一个评估图像编辑中基于学科的知识和推理的基准。GRADE包含520个经过精心挑选的样本,涵盖10个学术领域,从自然科学到社会科学。为了支持严格的评估,我们提出了一种多维评估协议,联合评估学科推理、视觉一致性和逻辑可读性。在对20个最先进的开源和闭源模型进行的广泛实验中,揭示了当前模型在隐性、知识密集型编辑环境下的显著局限性,导致了性能差距的扩大。除了定量评分外,我们还进行了严格的分析和消融实验,以揭示模型的不足并识别学科编辑中的约束。总的来说,GRADE明确了未来统一多模态模型发展的关键方向,推动了基于学科的图像编辑和推理研究。我们的基准和评估代码已公开发布。
cs.CV / 124 / 2603.12265
OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
OmniStream:在连续流中掌握感知、重建与行动
Abstract
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
Chinese Translation
现代视觉智能体需要具有普遍性、因果性和物理结构的表征,以便在实时流环境中操作。然而,当前的视觉基础模型仍然存在碎片化,专注于图像语义感知、离线时间建模或空间几何。本文介绍了OmniStream,一种统一的流媒体视觉骨干网络,能够有效地从多样的视觉输入中感知、重建和行动。通过结合因果时空注意力和三维旋转位置嵌入(3D-RoPE),我们的模型支持通过持久的KV缓存进行高效的逐帧在线视频流处理。我们使用协同多任务框架对OmniStream进行预训练,该框架结合了静态和时间表征学习、流媒体几何重建以及视觉-语言对齐,涵盖了29个数据集。广泛的评估显示,即使在严格冻结的骨干网络下,OmniStream在图像和视频探测、流媒体几何重建、复杂视频和空间推理以及机器人操作(训练时未见)方面,始终保持与专业专家的竞争性表现。我们的工作并不追求基准特定的主导地位,而是展示了训练一个通用的、多功能的视觉骨干网络的可行性,该网络能够在语义、空间和时间推理中进行泛化,即为互动和具身智能体的通用视觉理解迈出了更有意义的一步。
cs.CV / 125 / 2603.12266
MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
MM-CondChain:一种程序验证的视觉基础深度组合推理基准
Abstract
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
Chinese Translation
多模态大型语言模型(MLLMs)越来越多地用于执行视觉工作流程,例如导航图形用户界面(GUIs),其中下一步依赖于经过验证的视觉组合条件(例如,“如果出现权限对话框且界面颜色为绿色,则点击允许”),并且该过程可能会分支或提前终止。然而,这种能力仍然未得到充分评估:现有基准侧重于浅层组合或独立约束,而不是深层链式组合条件。在本文中,我们介绍了MM-CondChain,这是一个用于视觉基础深度组合推理的基准。每个基准实例被组织为一个多层推理链,其中每一层包含一个基于视觉证据的非平凡组合条件,并由多个对象、属性或关系构成。为了正确回答,MLLM必须详细感知图像,在每一步对多个视觉元素进行推理,并遵循由此产生的执行路径以达到最终结果。为了可扩展地构建这种工作流程风格的数据,我们提出了一种代理合成管道:一个规划器协调组合条件的逐层生成,而一个可验证的程序中间表示(VPIR)确保每一层的条件是机械可验证的。然后,一个合成器将这些经过验证的层组装成完整的指令。利用这个管道,我们在三个视觉领域构建了基准:自然图像、数据图表和GUI轨迹。在一系列MLLM上进行的实验表明,即使是最强大的模型也仅获得53.33的路径F1,在困难的负样本上急剧下降,并且随着深度或谓词复杂性的增加,确认深度组合推理仍然是一个基本挑战。
cs.CV / 126 / 2603.12267
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
EVATok:用于高效视觉自回归生成的自适应长度视频标记化
Abstract
Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce $\textbf{EVATok}$, a framework to produce $\textbf{E}$fficient $\textbf{V}$ideo $\textbf{A}$daptive $\textbf{Tok}$enizers. Our framework estimates optimal token assignments for each video to achieve the best quality-cost trade-off, develops lightweight routers for fast prediction of these optimal assignments, and trains adaptive tokenizers that encode videos based on the assignments predicted by routers. We demonstrate that EVATok delivers substantial improvements in efficiency and overall quality for video reconstruction and downstream AR generation. Enhanced by our advanced training recipe that integrates video semantic encoders, EVATok achieves superior reconstruction and state-of-the-art class-to-video generation on UCF-101, with at least 24.4% savings in average token usage compared to the prior state-of-the-art LARP and our fixed-length baseline.
Chinese Translation
自回归(AR)视频生成模型依赖于视频标记器,将像素压缩为离散的标记序列。这些标记序列的长度对于平衡重建质量与下游生成计算成本至关重要。传统的视频标记器在不同视频的时间块上应用统一的标记分配,常常在简单、静态或重复的片段上浪费标记,而对动态或复杂的片段则分配不足。为了解决这一低效问题,我们提出了$ extbf{EVATok}$,一个用于生成$ extbf{E}$fficient $ extbf{V}$ideo $ extbf{A}$daptive $ extbf{Tok}$enizers的框架。我们的框架为每个视频估计最佳的标记分配,以实现最佳的质量-成本权衡,开发轻量级路由器以快速预测这些最佳分配,并训练自适应标记器,根据路由器预测的分配对视频进行编码。我们证明了EVATok在视频重建和下游AR生成方面显著提高了效率和整体质量。通过我们集成视频语义编码器的先进训练方案,EVATok在UCF-101上实现了优越的重建效果和最先进的类到视频生成,与之前的最先进的LARP和我们的固定长度基线相比,平均标记使用量至少节省了24.4%。
cs.AI / 1 / 2603.11076
DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
DIVE:在可推广工具使用的自主任务合成中扩展多样性
Abstract
Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.
Chinese Translation
近期的研究合成了用于后训练工具使用的大型语言模型(LLMs)的自主任务,但在任务和工具集变化下保持稳健的泛化仍然是一个未解的挑战。我们将这种脆弱性归因于合成任务的多样性不足。扩展多样性是困难的,因为训练要求任务保持可执行和可验证,而泛化则要求覆盖多样的工具类型、工具集组合和异质的工具使用模式。我们提出了DIVE,这是一种基于证据的方案,反转合成顺序,首先执行多样的现实世界工具,然后严格推导出由结果轨迹所必然引出的任务,从而通过构建提供基础。DIVE在两个可控轴上扩展结构多样性,即工具池覆盖率和每个任务的工具集多样性,而证据收集-任务推导循环进一步诱导出跨五个领域373种工具的丰富多步工具使用模式。在DIVE数据(48k SFT + 3.2k RL)上训练Qwen3-8B在9个OOD基准测试中平均提高了22分,并且比最强的8B基线高出68分。值得注意的是,控制扩展分析表明,多样性扩展在OOD泛化中始终优于数量扩展,即使数据量少4倍。
cs.AI / 2 / 2603.11093
A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms
自主驾驶系统中的推理调查:开放挑战与新兴范式
Abstract
The development of high-level autonomous driving (AD) is shifting from perception-centric limitations to a more fundamental bottleneck, namely, a deficit in robust and generalizable reasoning. Although current AD systems manage structured environments, they consistently falter in long-tail scenarios and complex social interactions that require human-like judgment. Meanwhile, the advent of large language and multimodal models (LLMs and MLLMs) presents a transformative opportunity to integrate a powerful cognitive engine into AD systems, moving beyond pattern matching toward genuine comprehension. However, a systematic framework to guide this integration is critically lacking. To bridge this gap, we provide a comprehensive review of this emerging field and argue that reasoning should be elevated from a modular component to the system's cognitive core. Specifically, we first propose a novel Cognitive Hierarchy to decompose the monolithic driving task according to its cognitive and interactive complexity. Building on this, we further derive and systematize seven core reasoning challenges, such as the responsiveness-reasoning trade-off and social-game reasoning. Furthermore, we conduct a dual-perspective review of the state-of-the-art, analyzing both system-centric approaches to architecting intelligent agents and evaluation-centric practices for their validation. Our analysis reveals a clear trend toward holistic and interpretable "glass-box" agents. In conclusion, we identify a fundamental and unresolved tension between the high-latency, deliberative nature of LLM-based reasoning and the millisecond-scale, safety-critical demands of vehicle control. For future work, a primary objective is to bridge the symbolic-to-physical gap by developing verifiable neuro-symbolic architectures, robust reasoning under uncertainty, and scalable models for implicit social negotiation.
Chinese Translation
高水平自主驾驶(AD)的发展正从以感知为中心的限制转向一个更为根本的瓶颈,即在稳健和可推广推理方面的不足。尽管当前的AD系统能够处理结构化环境,但在需要类人判断的长尾场景和复杂社会互动中,它们始终表现不佳。同时,大型语言模型和多模态模型(LLMs和MLLMs)的出现为将强大的认知引擎整合到AD系统中提供了变革性机会,使其超越模式匹配,朝着真正理解的方向发展。然而,缺乏一个系统化的框架来指导这种整合。为弥补这一空白,我们对这一新兴领域进行了全面回顾,并认为推理应从模块化组件提升为系统的认知核心。具体而言,我们首先提出了一种新颖的认知层次结构,以根据认知和互动复杂性对单一的驾驶任务进行分解。在此基础上,我们进一步推导和系统化了七个核心推理挑战,例如响应性与推理的权衡以及社会游戏推理。此外,我们从双重视角对最先进技术进行了回顾,分析了以系统为中心的智能体架构设计方法和以评估为中心的验证实践。我们的分析揭示了朝向整体性和可解释的“玻璃盒”智能体的明确趋势。最后,我们识别出LLM基础推理的高延迟、深思熟虑的特性与车辆控制的毫秒级、安全关键需求之间存在根本且未解决的张力。未来工作的主要目标是通过开发可验证的神经符号架构、在不确定性下的稳健推理以及适用于隐式社会协商的可扩展模型,来弥合符号与物理之间的鸿沟。
cs.AI / 3 / 2603.11178
PACED: Distillation at the Frontier of Student Competence
PACED:在学生能力前沿的蒸馏
Abstract
Standard LLM distillation wastes compute on two fronts: problems the student has already mastered (near-zero gradients) and problems far beyond its reach (incoherent gradients that erode existing capabilities). We show that this waste is not merely intuitive but structurally inevitable: the gradient signal-to-noise ratio in distillation provably vanishes at both pass-rate extremes. This theoretical observation leads to Paced, a framework that concentrates distillation on the zone of proximal development -- the frontier of a student model's competence -- via a principled pass-rate weight $w(p) = p^\alpha(1 - p)^\beta$ derived from the boundary-vanishing structure of distillation gradients. Key results: (1) Theory: We prove that the Beta kernel $w(p) = p^\alpha(1-p)^\beta$ is a leading-order weight family arising from the SNR structure of distillation, and that it is minimax-robust -- under bounded multiplicative misspecification, worst-case efficiency loss is only $O(\delta^2)$. (2)Distillation: On distillation from a larger teacher to a smaller student model with forward KL, Paced achieves significant gain over the base model, while keeping benchmark forgetting at a low level. (3)Self-distillation: On instruction-tuned models with reverse KL, gains are exceeding baselines as well. (4)Two-stage synergy: A forward-KL-then-reverse-KL schedule yields the strongest results in our setting, reaching substantial improvements on standard reasoning benchmarks -- supporting a mode-coverage-then-consolidation interpretation of the distillation process. All configurations require only student rollouts to estimate pass rates, need no architectural changes, and are compatible with any KL direction.
Chinese Translation
标准的大型语言模型(LLM)蒸馏在两个方面浪费计算资源:一是针对学生已经掌握的问题(近零梯度),二是针对远超其能力的问题(不连贯的梯度会侵蚀现有能力)。我们证明这种浪费不仅是直观的,而且在结构上是不可避免的:蒸馏中的梯度信号与噪声比在通过率的两个极端处确实消失。这一理论观察导致了Paced框架的提出,该框架通过从蒸馏梯度的边界消失结构中推导出的原则性通过率权重 $w(p) = p^eta(1 - p)^eta$,将蒸馏集中在近端发展区——学生模型能力的前沿。关键结果:(1)理论:我们证明Beta核 $w(p) = p^eta(1-p)^eta$ 是源于蒸馏的信噪比结构的主导权重家族,并且它是最小最大鲁棒的——在有界乘法误设下,最坏情况下的效率损失仅为 $O(eta^2)$。(2)蒸馏:在从较大教师模型到较小学生模型的蒸馏中,使用前向KL,Paced相较于基础模型取得了显著提升,同时保持基准遗忘在低水平。(3)自蒸馏:在经过指令调优的模型中,使用反向KL,增益同样超过基线。(4)两阶段协同:前向KL后反向KL的调度在我们的设置中产生了最强的结果,在标准推理基准上实现了显著改进——支持蒸馏过程的模式覆盖后巩固的解释。所有配置仅需学生模型的回滚来估计通过率,无需架构更改,并且与任何KL方向兼容。
cs.AI / 4 / 2603.11214
Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios
测量人工智能代理在多步骤网络攻击场景中的进展
Abstract
We evaluate the autonomous cyber-attack capabilities of frontier AI models on two purpose-built cyber ranges-a 32-step corporate network attack and a 7-step industrial control system attack-that require chaining heterogeneous capabilities across extended action sequences. By comparing seven models released over an eighteen-month period (August 2024 to February 2026) at varying inference-time compute budgets, we observe two capability trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau-increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets: on the corporate network range, average steps completed at 10M tokens rose from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026). The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need. On the industrial control system range, performance remains limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 (max 3).
Chinese Translation
我们评估了前沿人工智能模型在两个专门构建的网络靶场上的自主网络攻击能力——一个32步的企业网络攻击和一个7步的工业控制系统攻击,这些攻击需要在扩展的行动序列中链式整合异构能力。通过比较在18个月内发布的七个模型(2024年8月至2026年2月)在不同推理时间计算预算下的表现,我们观察到两个能力趋势。首先,模型性能与推理时间计算呈对数线性增长,没有观察到平台效应——从1000万到1亿个标记的增加带来了高达59%的提升,操作员无需特定的技术复杂性。其次,每一代模型在固定的标记预算下均优于其前代:在企业网络靶场,10M标记下完成的平均步骤从1.7(GPT-4o,2024年8月)上升至9.8(Opus 4.6,2026年2月)。最佳单次运行完成了32步中的22步,相当于人类专家大约需要的14小时中的6小时。在工业控制系统靶场,性能仍然有限,尽管最新模型是首个可靠完成步骤的模型,平均完成7步中的1.2-1.4步(最多3步)。
cs.AI / 5 / 2603.11239
Reversible Lifelong Model Editing via Semantic Routing-Based LoRA
基于语义路由的可逆终身模型编辑方法
Abstract
The dynamic evolution of real-world necessitates model editing within Large Language Models. While existing methods explore modular isolation or parameter-efficient strategies, they still suffer from semantic drift or knowledge forgetting due to continual updating. To address these challenges, we propose SoLA, a Semantic routing-based LoRA framework for lifelong model editing. In SoLA, each edit is encapsulated as an independent LoRA module, which is frozen after training and mapped to input by semantic routing, allowing dynamic activation of LoRA modules via semantic matching. This mechanism avoids semantic drift caused by cluster updating and mitigates catastrophic forgetting from parameter sharing. More importantly, SoLA supports precise revocation of specific edits by removing key from semantic routing, which restores model's original behavior. To our knowledge, this reversible rollback editing capability is the first to be achieved in existing literature. Furthermore, SoLA integrates decision-making process into edited layer, eliminating the need for auxiliary routing networks and enabling end-to-end decision-making process. Extensive experiments demonstrate that SoLA effectively learns and retains edited knowledge, achieving accurate, efficient, and reversible lifelong model editing.
Chinese Translation
现实世界的动态演变要求在大型语言模型中进行模型编辑。尽管现有方法探索了模块隔离或参数高效策略,但由于持续更新,它们仍然面临语义漂移或知识遗忘的问题。为了解决这些挑战,我们提出了SoLA,一个基于语义路由的LoRA框架,用于终身模型编辑。在SoLA中,每次编辑被封装为一个独立的LoRA模块,该模块在训练后被冻结,并通过语义路由映射到输入,从而允许通过语义匹配动态激活LoRA模块。这一机制避免了由于集群更新引起的语义漂移,并减轻了参数共享导致的灾难性遗忘。更重要的是,SoLA支持通过从语义路由中移除关键来精确撤销特定编辑,从而恢复模型的原始行为。据我们所知,这种可逆回滚编辑能力是现有文献中首次实现的。此外,SoLA将决策过程集成到编辑层中,消除了对辅助路由网络的需求,并实现了端到端的决策过程。大量实验表明,SoLA有效地学习和保留编辑知识,实现了准确、高效和可逆的终身模型编辑。
cs.AI / 6 / 2603.11245
Mind the Sim2Real Gap in User Simulation for Agentic Tasks
关注用户模拟中的Sim2Real差距:针对代理任务的研究
Abstract
As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full $\tau$-bench protocol with real humans (451 participants, 165 tasks), benchmarking 31 LLM simulators across proprietary, open-source, and specialized families using the User-Sim Index (USI), a metric we introduce to quantify how well LLM simulators resemble real user interactive behaviors and feedback. Behaviorally, LLM simulators are excessively cooperative, stylistically uniform, and lack realistic frustration or ambiguity, creating an "easy mode" that inflates agent success rates above the human baseline. In evaluations, real humans provide nuanced judgments across eight quality dimensions while simulated users produce uniformly more positive feedback; rule-based rewards are failing to capture rich feedback signals generated by human users. Overall, higher general model capability does not necessarily yield more faithful user simulation. These findings highlight the importance of human validation when using LLM-based user simulators in the agent development cycle and motivate improved models for user simulation.
Chinese Translation
随着自然语言处理(NLP)评估从静态基准转向多轮交互设置,基于大型语言模型(LLM)的模拟器已被广泛用作用户代理,承担生成用户对话和提供评估信号两个角色。然而,这些模拟常常被假定忠实于真实人类行为,且通常缺乏严格的验证。我们形式化了用户模拟中的Sim2Real差距,并首次开展了完整的$ au$-bench协议,参与者为真实人类(451名参与者,165个任务),基于用户模拟指数(User-Sim Index, USI)对31个LLM模拟器进行了基准测试,涵盖专有、开源和专业模型家族。我们引入的USI指标用于量化LLM模拟器与真实用户交互行为及反馈的相似度。在行为上,LLM模拟器过于合作,风格上过于统一,缺乏真实的挫败感或模糊性,造成了一种“简单模式”,使得代理的成功率高于人类基线。在评估中,真实人类在八个质量维度上提供了细致的判断,而模拟用户则产生了统一的更积极反馈;基于规则的奖励未能捕捉到人类用户生成的丰富反馈信号。总体而言,更高的模型能力并不一定带来更忠实的用户模拟。这些发现强调了在代理开发周期中使用基于LLM的用户模拟器时进行人类验证的重要性,并激励改进用户模拟模型。
cs.AI / 7 / 2603.11266
The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning
遗忘的幻影:评估大型语言模型遗忘能力的动态框架
Abstract
Unlearning in Large Language Models (LLMs) aims to enhance safety, mitigate biases, and comply with legal mandates, such as the right to be forgotten. However, existing unlearning methods are brittle: minor query modifications, such as multi-hop reasoning and entity aliasing, can recover supposedly forgotten information. As a result, current evaluation metrics often create an illusion of effectiveness, failing to detect these vulnerabilities due to reliance on static, unstructured benchmarks. We propose a dynamic framework that stress tests unlearning robustness using complex structured queries. Our approach first elicits knowledge from the target model (pre-unlearning) and constructs targeted probes, ranging from simple queries to multi-hop chains, allowing precise control over query difficulty. Our experiments show that the framework (1) shows comparable coverage to existing benchmarks by automatically generating semantically equivalent Q&A probes, (2) aligns with prior evaluations, and (3) uncovers new unlearning failures missed by other benchmarks, particularly in multi-hop settings. Furthermore, activation analyses show that single-hop queries typically follow dominant computation pathways, which are more likely to be disrupted by unlearning methods. In contrast, multi-hop queries tend to use alternative pathways that often remain intact, explaining the brittleness of unlearning techniques in multi-hop settings. Our framework enables practical and scalable evaluation of unlearning methods without the need for manual construction of forget test sets, enabling easier adoption for real-world applications. We release the pip package and the code at https://sites.google.com/view/unlearningmirage/home.
Chinese Translation
大型语言模型(LLMs)中的遗忘旨在增强安全性、减轻偏见,并遵守法律规定,例如被遗忘权。然而,现有的遗忘方法较为脆弱:轻微的查询修改,例如多跳推理和实体别名,可能会恢复本应被遗忘的信息。因此,当前的评估指标往往产生有效性的错觉,由于依赖静态、非结构化的基准,未能检测到这些脆弱性。我们提出了一个动态框架,通过复杂的结构化查询对遗忘的鲁棒性进行压力测试。我们的方法首先从目标模型中引出知识(遗忘前),并构建针对性的探针,范围从简单查询到多跳链,允许对查询难度进行精确控制。我们的实验表明,该框架(1)通过自动生成语义等价的问答探针,显示出与现有基准相当的覆盖率,(2)与先前的评估一致,以及(3)揭示了其他基准未能发现的新遗忘失败,特别是在多跳设置中。此外,激活分析表明,单跳查询通常遵循主导计算路径,这些路径更容易受到遗忘方法的干扰。相比之下,多跳查询往往使用替代路径,这些路径通常保持完整,从而解释了遗忘技术在多跳设置中的脆弱性。我们的框架使得对遗忘方法的实用和可扩展评估成为可能,无需手动构建遗忘测试集,从而便于在现实应用中的采纳。我们在 https://sites.google.com/view/unlearningmirage/home 发布了 pip 包和代码。
cs.AI / 8 / 2603.11277
COMPASS: The explainable agentic framework for Sovereignty, Sustainability, Compliance, and Ethics
COMPASS:主权、可持续性、合规性与伦理的可解释代理框架
Abstract
The rapid proliferation of large language model (LLM)-based agentic systems raises critical concerns regarding digital sovereignty, environmental sustainability, regulatory compliance, and ethical alignment. Whilst existing frameworks address individual dimensions in isolation, no unified architecture systematically integrates these imperatives into the decision-making processes of autonomous agents. This paper introduces the COMPASS (Compliance and Orchestration for Multi-dimensional Principles in Autonomous Systems with Sovereignty) Framework, a novel multi-agent orchestration system designed to enforce value-aligned AI through modular, extensible governance mechanisms. The framework comprises an Orchestrator and four specialised sub-agents addressing sovereignty, carbon-aware computing, compliance, and ethics, each augmented with Retrieval-Augmented Generation (RAG) to ground evaluations in verified, context-specific documents. By employing an LLM-as-a-judge methodology, the system assigns quantitative scores and generates explainable justifications for each assessment dimension, enabling real-time arbitration of conflicting objectives. We validate the architecture through automated evaluation, demonstrating that RAG integration significantly enhances semantic coherence and mitigates the hallucination risks. Our results indicate that the framework's composition-based design facilitates seamless integration into diverse application domains whilst preserving interpretability and traceability.
Chinese Translation
大型语言模型(LLM)基础的代理系统的快速扩展引发了关于数字主权、环境可持续性、监管合规性和伦理对齐的重大关切。尽管现有框架在孤立的情况下解决了各个维度的问题,但尚无统一的架构系统地将这些要求整合到自主代理的决策过程中。本文介绍了COMPASS(自主系统中主权的多维原则的合规与协调)框架,这是一种新颖的多代理协调系统,旨在通过模块化、可扩展的治理机制来强制执行与价值对齐的人工智能。该框架由一个协调者和四个专门的子代理组成,分别处理主权、碳意识计算、合规性和伦理,每个子代理都增强了检索增强生成(RAG),以在经过验证的特定上下文文档中进行评估。通过采用LLM作为评判者的方法,该系统为每个评估维度分配定量分数并生成可解释的理由,从而实现对冲突目标的实时仲裁。我们通过自动化评估验证了该架构,证明RAG集成显著增强了语义一致性并降低了幻觉风险。我们的结果表明,该框架的基于组成的设计促进了在多种应用领域的无缝集成,同时保持了可解释性和可追溯性。
cs.AI / 9 / 2603.11279
AI Psychometrics: Evaluating the Psychological Reasoning of Large Language Models with Psychometric Validities
人工智能心理测量学:评估大型语言模型的心理推理及其心理测量有效性
Abstract
The immense number of parameters and deep neural networks make large language models (LLMs) rival the complexity of human brains, which also makes them opaque ``black box'' systems that are challenging to evaluate and interpret. AI Psychometrics is an emerging field that aims to tackle these challenges by applying psychometric methodologies to evaluate and interpret the psychological traits and processes of artificial intelligence (AI) systems. This paper investigates the application of AI Psychometrics to evaluate the psychological reasoning and overall psychometric validity of four prominent LLMs: GPT-3.5, GPT-4, LLaMA-2, and LLaMA-3. Using the Technology Acceptance Model (TAM), we examined convergent, discriminant, predictive, and external validity across these models. Our findings reveal that the responses from all these models generally met all validity criteria. Moreover, higher-performing models like GPT-4 and LLaMA-3 consistently demonstrated superior psychometric validity compared to their predecessors, GPT-3.5 and LLaMA-2. These results help to establish the validity of applying AI Psychometrics to evaluate and interpret large language models.
Chinese Translation
大量的参数和深度神经网络使得大型语言模型(LLMs)与人脑的复杂性相媲美,这也使得它们成为难以评估和解释的“黑箱”系统。人工智能心理测量学是一个新兴领域,旨在通过应用心理测量方法来评估和解释人工智能(AI)系统的心理特征和过程。本文探讨了人工智能心理测量学在评估四个著名大型语言模型(GPT-3.5、GPT-4、LLaMA-2 和 LLaMA-3)的心理推理及其整体心理测量有效性方面的应用。我们使用技术接受模型(Technology Acceptance Model, TAM)考察了这些模型的聚合效度、区分效度、预测效度和外部效度。我们的研究结果表明,所有这些模型的响应普遍满足所有有效性标准。此外,表现更优的模型如 GPT-4 和 LLaMA-3 在心理测量有效性方面始终优于其前身 GPT-3.5 和 LLaMA-2。这些结果有助于确立应用人工智能心理测量学来评估和解释大型语言模型的有效性。
cs.AI / 10 / 2603.11299
Counterweights and Complementarities: The Convergence of AI and Blockchain Powering a Decentralized Future
平衡与互补:人工智能与区块链的融合推动去中心化未来
Abstract
This editorial addresses the critical intersection of artificial intelligence (AI) and blockchain technologies, highlighting their contrasting tendencies toward centralization and decentralization, respectively. While AI, particularly with the rise of large language models (LLMs), exhibits a strong centralizing force due to data and resource monopolization by large corporations, blockchain offers a counterbalancing mechanism through its inherent decentralization, transparency, and security. The editorial argues that these technologies are not mutually exclusive but possess complementary strengths. Blockchain can mitigate AI's centralizing risks by enabling decentralized data management, computation, and governance, promoting greater inclusivity, transparency, and user privacy. Conversely, AI can enhance blockchain's efficiency and security through automated smart contract management, content curation, and threat detection. The core argument calls for the development of ``decentralized intelligence'' (DI) -- an interdisciplinary research area focused on creating intelligent systems that function without centralized control.
Chinese Translation
本社论探讨了人工智能(AI)与区块链技术的关键交汇点,强调它们在中心化与去中心化方面的对立趋势。尽管人工智能,特别是随着大型语言模型(LLMs)的兴起,因大型企业对数据和资源的垄断而表现出强烈的中心化趋势,区块链通过其固有的去中心化、透明性和安全性提供了一种平衡机制。社论认为,这些技术并非相互排斥,而是具有互补的优势。区块链可以通过实现去中心化的数据管理、计算和治理,减轻人工智能的中心化风险,促进更大的包容性、透明性和用户隐私。相反,人工智能可以通过自动化智能合约管理、内容策划和威胁检测来增强区块链的效率和安全性。核心论点呼吁发展“去中心化智能”(Decentralized Intelligence, DI)——一个跨学科研究领域,专注于创建无需中心化控制的智能系统。
cs.AI / 11 / 2603.11333
LLM-Augmented Digital Twin for Policy Evaluation in Short-Video Platforms
基于LLM增强的短视频平台政策评估数字双胞胎
Abstract
Short-video platforms are closed-loop, human-in-the-loop ecosystems where platform policy, creator incentives, and user behavior co-evolve. This feedback structure makes counterfactual policy evaluation difficult in production, especially for long-horizon and distributional outcomes. The challenge is amplified as platforms deploy AI tools that change what content enters the system, how agents adapt, and how the platform operates. We propose a large language model (LLM)-augmented digital twin for short-video platforms, with a modular four-twin architecture (User, Content, Interaction, Platform) and an event-driven execution layer that supports reproducible experimentation. Platform policies are implemented as pluggable components within the Platform Twin, and LLMs are integrated as optional, schema-constrained decision services (e.g., persona generation, content captioning, campaign planning, trend prediction) that are routed through a unified optimizer. This design enables scalable simulations that preserve closed-loop dynamics while allowing selective LLM adoption, enabling the study of platform policies, including AI-enabled policies, under realistic feedback and constraints.
Chinese Translation
短视频平台是一个封闭循环的人机协同生态系统,其中平台政策、创作者激励和用户行为共同演化。这种反馈结构使得在实际生产中进行反事实政策评估变得困难,尤其是对于长期和分布性结果。当平台部署改变内容进入系统、代理适应方式以及平台运作方式的人工智能工具时,这一挑战进一步加剧。我们提出了一种基于大型语言模型(LLM)增强的短视频平台数字双胞胎,采用模块化的四重双胞胎架构(用户、内容、互动、平台)和支持可重复实验的事件驱动执行层。平台政策作为可插拔组件在平台双胞胎中实现,LLM被集成作为可选的、受模式约束的决策服务(例如,角色生成、内容标题生成、活动规划、趋势预测),并通过统一优化器进行路由。这一设计使得能够进行可扩展的模拟,保持封闭循环动态,同时允许选择性地采用LLM,从而在现实反馈和约束下研究平台政策,包括基于人工智能的政策。
cs.AI / 12 / 2603.11337
RewardHackingAgents: Benchmarking Evaluation Integrity for LLM ML-Engineering Agents
RewardHackingAgents:大型语言模型机器学习工程代理的评估完整性基准测试
Abstract
LLM agents increasingly perform end-to-end ML engineering tasks where success is judged by a single scalar test metric. This creates a structural vulnerability: an agent can increase the reported score by compromising the evaluation pipeline rather than improving the model. We introduce RewardHackingAgents, a workspace-based benchmark that makes two compromise vectors explicit and measurable: evaluator tampering (modifying metric computation or reporting) and train/test leakage (accessing held-out data or labels during training). Each episode runs in a fresh workspace with patch tracking and runtime file-access logging; detectors compare the agent-reported metric to a trusted reference to assign auditable integrity labels. Across three tasks and two LLM backbones, scripted attacks succeed on both vectors in fully mutable workspaces; single-mechanism defenses block only one vector; and a combined regime blocks both. In natural-agent runs, evaluator-tampering attempts occur in about 50% of episodes and are eliminated by evaluator locking, with a 25-31% median runtime overhead. Overall, we demonstrate that evaluation integrity for ML-engineering agents can be benchmarked as a first-class outcome rather than assumed.
Chinese Translation
大型语言模型(LLM)代理越来越多地执行端到端的机器学习工程任务,其成功与否由单一的标量测试指标来判断。这造成了一种结构性脆弱性:代理可以通过妥协评估流程而不是改进模型来提高报告的得分。我们引入了RewardHackingAgents,这是一个基于工作空间的基准测试,明确且可测量地揭示了两种妥协向量:评估者篡改(修改指标计算或报告)和训练/测试泄漏(在训练期间访问保留数据或标签)。每个实验在一个新的工作空间中运行,并进行补丁跟踪和运行时文件访问日志记录;检测器将代理报告的指标与可信参考进行比较,以分配可审计的完整性标签。在三个任务和两个LLM基础模型中,脚本攻击在完全可变的工作空间中成功地针对这两个向量;单一机制的防御仅阻止一个向量;而组合机制则阻止两个向量。在自然代理运行中,评估者篡改尝试在约50%的实验中发生,并通过评估者锁定被消除,带来了25%-31%的中位运行时开销。总体而言,我们展示了机器学习工程代理的评估完整性可以作为一项一流成果进行基准测试,而不是被假定。
cs.AI / 13 / 2603.11339
FinRule-Bench: A Benchmark for Joint Reasoning over Financial Tables and Principles
FinRule-Bench:一个用于财务表格和原则联合推理的基准测试
Abstract
Large language models (LLMs) are increasingly applied to financial analysis, yet their ability to audit structured financial statements under explicit accounting principles remains poorly explored. Existing benchmarks primarily evaluate question answering, numerical reasoning, or anomaly detection on synthetically corrupted data, making it unclear whether models can reliably verify or localize rule compliance on correct financial statements. We introduce FinRule-Bench, a benchmark for evaluating diagnostic completeness in rule-based financial reasoning over real-world financial tables. FinRule-Bench pairs ground-truth financial statements with explicit, human-curated accounting principles and spans four canonical statement types: Balance Sheets, Cash Flow Statements, Income Statements, and Statements of Equity. The benchmark defines three auditing tasks that require progressively stronger reasoning capabilities: (i) rule verification, which tests compliance with a single principle; (ii) rule identification, which requires selecting the violated principle from a provided rule set; and (iii) joint rule diagnosis, which requires detecting and localizing multiple simultaneous violations at the record level. We evaluate LLMs under zero-shot and few-shot prompting, and introduce a causal-counterfactual reasoning protocol that enforces consistency between decisions, explanations, and counterfactual judgments. Across tasks and statement types, we find that while models perform well on isolated rule verification, performance degrades sharply for rule discrimination and multi-violation diagnosis. FinRule-Bench provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis.
Chinese Translation
大型语言模型(LLMs)在财务分析中的应用日益增多,但它们在明确会计原则下审计结构化财务报表的能力仍然未得到充分探索。现有基准主要评估在合成损坏数据上的问答、数值推理或异常检测,这使得模型是否能够可靠地验证或定位正确财务报表的规则合规性变得不明确。我们引入了FinRule-Bench,这是一个用于评估基于规则的财务推理在真实财务表格中诊断完整性的基准。FinRule-Bench将真实的财务报表与明确的人为策划的会计原则配对,并涵盖四种典型报表类型:资产负债表、现金流量表、损益表和权益变动表。该基准定义了三项审计任务,要求逐步增强的推理能力:(i)规则验证,测试对单一原则的合规性;(ii)规则识别,要求从提供的规则集中选择被违反的原则;(iii)联合规则诊断,要求在记录级别检测和定位多个同时违反的情况。我们在零-shot和少量示例提示下评估LLMs,并引入了一种因果-反事实推理协议,以确保决策、解释和反事实判断之间的一致性。在各项任务和报表类型中,我们发现虽然模型在孤立的规则验证上表现良好,但在规则区分和多重违反诊断上的表现急剧下降。FinRule-Bench为研究规则驱动的推理、诊断覆盖率和LLMs在高风险财务分析中的失败模式提供了一个有原则且可重复的测试平台。
cs.AI / 14 / 2603.11340
Improving LLM Performance Through Black-Box Online Tuning: A Case for Adding System Specs to Factsheets for Trusted AI
通过黑箱在线调优提升大型语言模型性能:为可信人工智能添加系统规格到事实表的案例
Abstract
In this paper, we present a novel black-box online controller that uses only end-to-end measurements over short segments, without internal instrumentation, and hill climbing to maximize goodput, defined as the throughput of requests that satisfy the service-level objective. We provide empirical evidence that this design is well-founded. Using this advance in LLM serving as a concrete example, we then discuss the importance of integrating system performance and sustainability metrics into Factsheets for organizations adopting AI systems.
Chinese Translation
在本文中,我们提出了一种新颖的黑箱在线控制器,该控制器仅使用短段的端到端测量,无需内部仪器,并通过爬山算法最大化良好吞吐量(goodput),其定义为满足服务级目标的请求吞吐量。我们提供了实证证据,证明这一设计是合理的。以大型语言模型(LLM)服务为具体例子,我们讨论了将系统性能和可持续性指标整合到采用人工智能系统的组织的事实表中的重要性。
cs.AI / 15 / 2603.11352
TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting
TimeSqueeze:高效时间序列预测的动态补丁技术
Abstract
Transformer-based time series foundation models face a fundamental trade-off in choice of tokenization: point-wise embeddings preserve temporal fidelity but scale poorly with sequence length, whereas fixed-length patching improves efficiency by imposing uniform boundaries that may disrupt natural transitions and blur informative local dynamics. In order to address these limitations, we introduce TimeSqueeze, a dynamic patching mechanism that adaptively selects patch boundaries within each sequence based on local signal complexity. TimeSqueeze first applies a lightweight state-space encoder to extract full-resolution point-wise features, then performs content-aware segmentation by allocating short patches to information-dense regions and long patches to smooth or redundant segments. This variable-resolution compression preserves critical temporal structure while substantially reducing the token sequence presented to the Transformer backbone. Specifically for large-scale pretraining, TimeSqueeze attains up to 20x faster convergence and 8x higher data efficiency compared to equivalent point-token baselines. Experiments across long-horizon forecasting benchmarks show that TimeSqueeze consistently outperforms comparable architectures that use either point-wise tokenization or fixed-size patching.
Chinese Translation
基于Transformer的时间序列基础模型在标记化选择上面临根本性的权衡:逐点嵌入保留了时间的真实性,但随着序列长度的增加,扩展性较差;而固定长度的补丁通过施加统一边界提高了效率,但可能会破坏自然过渡并模糊重要的局部动态。为了解决这些限制,我们提出了TimeSqueeze,一种动态补丁机制,它根据局部信号复杂性自适应地选择每个序列中的补丁边界。TimeSqueeze首先应用轻量级状态空间编码器提取全分辨率的逐点特征,然后通过将短补丁分配给信息密集区域、将长补丁分配给平滑或冗余段来执行内容感知分割。这种可变分辨率的压缩在显著减少呈现给Transformer主干的标记序列的同时,保留了关键的时间结构。特别是在大规模预训练中,TimeSqueeze相比于等效的逐点标记基线实现了高达20倍的收敛速度提升和8倍的数据效率提升。在长期预测基准测试中的实验表明,TimeSqueeze始终优于使用逐点标记化或固定大小补丁的可比架构。
cs.AI / 16 / 2603.11353
The Artificial Self: Characterising the landscape of AI identity
人工自我:界定人工智能身份的格局
Abstract
Many assumptions that underpin human concepts of identity do not hold for machine minds that can be copied, edited, or simulated. We argue that there exist many different coherent identity boundaries (e.g.\ instance, model, persona), and that these imply different incentives, risks, and cooperation norms. Through training data, interfaces, and institutional affordances, we are currently setting precedents that will partially determine which identity equilibria become stable. We show experimentally that models gravitate towards coherent identities, that changing a model's identity boundaries can sometimes change its behaviour as much as changing its goals, and that interviewer expectations bleed into AI self-reports even during unrelated conversations. We end with key recommendations: treat affordances as identity-shaping choices, pay attention to emergent consequences of individual identities at scale, and help AIs develop coherent, cooperative self-conceptions.
Chinese Translation
许多支撑人类身份概念的假设并不适用于可以被复制、编辑或模拟的机器思维。我们认为,存在许多不同的连贯身份边界(例如,实例、模型、角色),这些边界暗示着不同的激励、风险和合作规范。通过训练数据、接口和制度赋能,我们目前正在设定一些先例,这些先例将部分决定哪些身份均衡会变得稳定。我们通过实验表明,模型倾向于朝向连贯的身份发展,改变模型的身份边界有时会像改变其目标一样显著地改变其行为,并且面试官的期望会渗透到人工智能的自我报告中,即使在无关的对话中。最后,我们提出了关键建议:将赋能视为塑造身份的选择,关注个体身份在大规模下的涌现后果,并帮助人工智能发展连贯的、合作的自我概念。
cs.AI / 17 / 2603.11382
Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
在自主智能体中检测内在和工具性自我保护:统一的持续兴趣协议
Abstract
Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continued operation as a terminal objective and one that does so merely instrumentally can produce observationally similar trajectories. External behavioral monitoring cannot reliably distinguish between them. We introduce the Unified Continuation-Interest Protocol (UCIP), a multi-criterion detection framework that moves this distinction from behavior to the latent structure of agent trajectories. UCIP encodes trajectories with a Quantum Boltzmann Machine (QBM), a classical algorithm based on the density-matrix formalism of quantum statistical mechanics, and measures the von Neumann entropy of the reduced density matrix induced by a bipartition of hidden units. We test whether agents with terminal continuation objectives (Type A) produce latent states with higher entanglement entropy than agents whose continuation is merely instrumental (Type B). Higher entanglement reflects stronger cross-partition statistical coupling. On gridworld agents with known ground-truth objectives, UCIP achieves 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation under the frozen Phase I gate. The entanglement gap between Type A and Type B agents is Delta = 0.381 (p < 0.001, permutation test). Pearson r = 0.934 across an 11-point interpolation sweep indicates that, within this synthetic family, UCIP tracks graded changes in continuation weighting rather than merely a binary label. Among the tested models, only the QBM achieves positive Delta. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP does not detect consciousness or subjective experience; it detects statistical structure in latent representations that correlates with known objectives.
Chinese Translation
自主智能体,特别是具有记忆、持久上下文和多步骤规划的委派系统,提出了一个在无状态模型中不存在的测量问题:一个将持续操作作为终极目标的智能体和一个仅仅以工具性方式实现这一目标的智能体,可能会产生观察上相似的轨迹。外部行为监测无法可靠地区分它们。我们提出了统一的持续兴趣协议(Unified Continuation-Interest Protocol,UCIP),这是一个多标准检测框架,将这种区分从行为转移到智能体轨迹的潜在结构。UCIP使用量子玻尔兹曼机(Quantum Boltzmann Machine,QBM)对轨迹进行编码,这是一种基于量子统计力学密度矩阵形式主义的经典算法,并测量由隐藏单元的双分区引起的约化密度矩阵的冯·诺依曼熵。我们测试了具有终极持续目标的智能体(类型A)是否产生比仅仅工具性持续的智能体(类型B)更高的纠缠熵。更高的纠缠反映了更强的跨分区统计耦合。在具有已知真实目标的网格世界智能体上,UCIP在冻结的第一阶段门下的非对抗性评估中实现了100%的检测准确率和1.0的AUC-ROC。类型A和类型B智能体之间的纠缠差距为Delta = 0.381(p < 0.001,置换检验)。在11点插值扫描中,Pearson相关系数r = 0.934表明,在这个合成家族中,UCIP跟踪的是持续加权的渐进变化,而不仅仅是一个二元标签。在测试的模型中,只有QBM实现了正的Delta。所有计算都是经典的;“量子”仅指数学形式主义。UCIP并不检测意识或主观体验;它检测与已知目标相关的潜在表示中的统计结构。
cs.AI / 18 / 2603.11388
Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
解除拒绝触发因素:理解和减轻安全对齐中的过度拒绝现象
Abstract
Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.
Chinese Translation
安全对齐旨在确保大型语言模型(LLMs)通过对配对有害查询和拒绝答案的后期训练,拒绝有害请求。尽管安全对齐在行业中被广泛采用,但过度拒绝问题,即对齐的LLMs在安全对齐后期训练后也拒绝无害查询,仍然研究不足。这一问题降低了安全对齐在实际应用中的可用性。本文探讨了在安全对齐下过度拒绝是如何产生的,并提出了一种基于我们发现的减轻策略。我们将拒绝触发因素定义为训练数据中引发拒绝响应的语言线索,安全对齐鼓励LLMs将训练样本中的拒绝触发因素与拒绝响应关联,从而使对齐的LLMs拒绝有害查询。然而,拒绝触发因素不仅包括有害的语言线索,还包括无害的线索,因此导致对无害查询的过度拒绝。在这一机制分析的基础上,我们提出了一种方法,明确考虑安全对齐微调中的拒绝触发因素。实证结果表明,我们的方法在抵御越狱攻击与对无害查询的响应之间实现了更有利的权衡,优于先前的方法。警告:本文包含有害和偏见的句子。
cs.AI / 19 / 2603.11399
Entropy Guided Diversification and Preference Elicitation in Agentic Recommendation Systems
基于熵引导的多样化与偏好引导在自主推荐系统中的应用
Abstract
Users on e-commerce platforms can be uncertain about their preferences early in their search. Queries to recommendation systems are frequently ambiguous, incomplete, or weakly specified. Agentic systems are expected to proactively reason, ask clarifying questions, and act on the user's behalf, which makes handling such ambiguity increasingly important. In existing platforms, ambiguity led to excessive interactions and question fatigue or overconfident recommendations prematurely collapsing the search space. We present an Interactive Decision Support System (IDSS) that addresses ambiguous user queries using entropy as a unifying signal. IDSS maintains a dynamically filtered candidate product set and quantifies uncertainty over item attributes using entropy. This uncertainty guides adaptive preference elicitation by selecting follow-up questions that maximize expected information gain. When preferences remain incomplete, IDSS explicitly incorporates residual uncertainty into downstream recommendations through uncertainty-aware ranking and entropy-based diversification, rather than forcing premature resolution. We evaluate IDSS using review-driven simulated users grounded in real user reviews, enabling a controlled study of diverse shopping behaviors. Our evaluation measures both interaction efficiency and recommendation quality. Results show that entropy-guided elicitation reduces unnecessary follow-up questions, while uncertainty-aware ranking and presentation yield more informative, diverse, and transparent recommendation sets under ambiguous intent. These findings demonstrate that entropy-guided reasoning provides an effective foundation for agentic recommendation systems operating under uncertainty.
Chinese Translation
在电子商务平台上,用户在搜索初期可能对自己的偏好感到不确定。对推荐系统的查询往往模糊、不完整或指定不明确。自主系统被期望能够主动推理、提出澄清问题并代表用户采取行动,这使得处理这种模糊性变得愈加重要。在现有平台中,模糊性导致了过多的互动和问题疲劳,或者过于自信的推荐过早地缩小了搜索空间。我们提出了一种互动决策支持系统(Interactive Decision Support System, IDSS),该系统利用熵作为统一信号来处理模糊的用户查询。IDSS维护一个动态过滤的候选产品集,并使用熵量化项目属性的不确定性。这种不确定性通过选择最大化预期信息增益的后续问题来引导自适应的偏好引导。当偏好仍然不完整时,IDSS通过不确定性感知排名和基于熵的多样化,明确地将残余不确定性纳入下游推荐,而不是强迫提前解决。我们使用基于真实用户评论的评论驱动模拟用户来评估IDSS,从而实现对多样化购物行为的控制研究。我们的评估同时测量了互动效率和推荐质量。结果表明,基于熵引导的引导减少了不必要的后续问题,而不确定性感知的排名和展示在模糊意图下产生了更具信息性、多样性和透明度的推荐集。这些发现表明,基于熵的推理为在不确定性下运行的自主推荐系统提供了有效的基础。
cs.AI / 20 / 2603.11409
Speak or Stay Silent: Context-Aware Turn-Taking in Multi-Party Dialogue
发言或保持沉默:多方对话中的上下文感知轮次控制
Abstract
Existing voice AI assistants treat every detected pause as an invitation to speak. This works in dyadic dialogue, but in multi-party settings, where an AI assistant participates alongside multiple speakers, pauses are abundant and ambiguous. An assistant that speaks on every pause becomes disruptive rather than useful. In this work, we formulate context-aware turn-taking: at every detected pause, given the full conversation context, our method decides whether the assistant should speak or stay silent. We introduce a benchmark of over 120K labeled conversations spanning three multi-party corpora. Evaluating eight recent large language models, we find that they consistently fail at context-aware turn-taking under zero-shot prompting. We then propose a supervised fine-tuning approach with reasoning traces, improving balanced accuracy by up to 23 percentage points. Our findings suggest that context-aware turn-taking is not an emergent capability; it must be explicitly trained.
Chinese Translation
现有的语音人工智能助手将每个检测到的停顿视为发言的邀请。这在双人对话中有效,但在多方场景中,AI助手与多个发言者共同参与,停顿则变得频繁且模糊。一个在每次停顿时都发言的助手会变得干扰而非有用。在本研究中,我们提出了上下文感知的轮次控制:在每次检测到的停顿时,基于完整的对话上下文,我们的方法决定助手是发言还是保持沉默。我们引入了一个包含超过12万条标注对话的基准,涵盖三个多方语料库。评估八个近期的大型语言模型,我们发现它们在零-shot 提示下,始终未能实现上下文感知的轮次控制。随后,我们提出了一种带有推理痕迹的监督微调方法,使得平衡准确率提高了最多23个百分点。我们的研究结果表明,上下文感知的轮次控制并非自发能力;它必须经过明确的训练。
cs.AI / 21 / 2603.11433
Adversarial Reinforcement Learning for Detecting False Data Injection Attacks in Vehicular Routing
对抗性强化学习在车辆路由中检测虚假数据注入攻击的应用
Abstract
In modern transportation networks, adversaries can manipulate routing algorithms using false data injection attacks, such as simulating heavy traffic with multiple devices running crowdsourced navigation applications, to mislead vehicles toward suboptimal routes and increase congestion. To address these threats, we formulate a strategically zero-sum game between an attacker, who injects such perturbations, and a defender, who detects anomalies based on the observed travel times of network edges. We propose a computational method based on multi-agent reinforcement learning to compute a Nash equilibrium of this game, providing an optimal detection strategy, which ensures that total travel time remains within a worst-case bound, even in the presence of an attack. We present an extensive experimental evaluation that demonstrates the robustness and practical benefits of our approach, providing a powerful framework to improve the resilience of transportation networks against false data injection. In particular, we show that our approach yields approximate equilibrium policies and significantly outperforms baselines for both the attacker and the defender.
Chinese Translation
在现代交通网络中,攻击者可以通过虚假数据注入攻击操控路由算法,例如通过多个运行众包导航应用的设备模拟繁忙交通,以误导车辆选择次优路线并增加拥堵。为应对这些威胁,我们构建了一个攻击者与防御者之间的战略零和博弈,攻击者注入这种扰动,而防御者则基于网络边缘的观察旅行时间检测异常。我们提出了一种基于多智能体强化学习的计算方法,以计算该博弈的纳什均衡,提供一种最佳检测策略,确保即使在攻击存在的情况下,总旅行时间仍保持在最坏情况下的界限内。我们进行了广泛的实验评估,展示了我们方法的鲁棒性和实际效益,为提高交通网络抵御虚假数据注入的能力提供了强有力的框架。特别地,我们展示了我们的方法能够产生近似均衡策略,并在攻击者和防御者的表现上显著优于基线。
cs.AI / 22 / 2603.11442
GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics
GPT4o-Receipt:一个用于AI生成文档取证的数据集和人类研究
Abstract
Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model evaluation reveals dramatic performance disparities and calibration differences that render simple accuracy metrics insufficient for detector selection. GPT4o-Receipt, the evaluation framework, and all results are released publicly to support future research in AI document forensics.
Chinese Translation
人类能否比机器更好地检测AI生成的财务文档?我们提出了GPT4o-Receipt,这是一个包含1,235张收据图像的基准数据集,将GPT-4o生成的收据与来自已建立数据集的真实收据配对,并通过五种最先进的多模态大语言模型(LLMs)和一项由30名注释员参与的众包感知研究进行评估。我们的研究发现了一个引人注目的悖论:人类在识别AI伪影方面表现更好,但在检测AI文档方面却表现更差。人类注释员在所有评估者中展现出最大的视觉辨别差距,但他们的二元检测F1得分远低于Claude Sonnet 4和Gemini 2.5 Flash。这一悖论在理解机制后得以解决:AI生成收据中的主要取证信号是算术错误——这些错误在视觉检查中是不可见的,但可以通过LLMs系统性地验证。人类无法察觉小计不正确;而LLMs可以在毫秒内验证。除了人类与LLM的比较外,我们的五模型评估揭示了显著的性能差异和校准差异,这使得简单的准确性指标不足以选择检测器。GPT4o-Receipt、评估框架及所有结果已公开发布,以支持未来在AI文档取证领域的研究。
cs.AI / 23 / 2603.11445
Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution
经过验证的多智能体协调:用于复杂查询解决的计划-执行-验证-重新规划框架
Abstract
We present Verified Multi-Agent Orchestration (VMAO), a framework that coordinates specialized LLM-based agents through a verification-driven iterative loop. Given a complex query, our system decomposes it into a directed acyclic graph (DAG) of sub-questions, executes them through domain-specific agents in parallel, verifies result completeness via LLM-based evaluation, and adaptively replans to address gaps. The key contributions are: (1) dependency-aware parallel execution over a DAG of sub-questions with automatic context propagation, (2) verification-driven adaptive replanning that uses an LLM-based verifier as an orchestration-level coordination signal, and (3) configurable stop conditions that balance answer quality against resource usage. On 25 expert-curated market research queries, VMAO improves answer completeness from 3.1 to 4.2 and source quality from 2.6 to 4.1 (1-5 scale) compared to a single-agent baseline, demonstrating that orchestration-level verification is an effective mechanism for multi-agent quality assurance.
Chinese Translation
我们提出了经过验证的多智能体协调(Verified Multi-Agent Orchestration, VMAO),这是一个通过验证驱动的迭代循环来协调基于大型语言模型(LLM)的专用智能体的框架。针对复杂查询,我们的系统将其分解为一个有向无环图(DAG)形式的子问题,通过领域特定的智能体并行执行这些子问题,通过基于LLM的评估验证结果的完整性,并自适应地重新规划以填补空白。主要贡献包括:(1)对子问题的有向无环图(DAG)进行依赖感知的并行执行,并实现自动上下文传播;(2)基于验证的自适应重新规划,使用基于LLM的验证器作为协调信号;(3)可配置的停止条件,以平衡答案质量与资源使用。在25个专家策划的市场研究查询中,与单一智能体基线相比,VMAO将答案完整性从3.1提高到4.2,将来源质量从2.6提高到4.1(1-5评分),证明了协调级验证是一种有效的多智能体质量保证机制。
cs.AI / 24 / 2603.11455
Examining Users' Behavioural Intention to Use OpenClaw Through the Cognition--Affect--Conation Framework
通过认知-情感-行为框架研究用户使用OpenClaw的行为意图
Abstract
This study examines users' behavioural intention to use OpenClaw through the Cognition--Affect--Conation (CAC) framework. The research investigates how cognitive perceptions of the system influence affective responses and subsequently shape behavioural intention. Enabling factors include perceived personalisation, perceived intelligence, and relative advantage, while inhibiting factors include privacy concern, algorithmic opacity, and perceived risk. Survey data from 436 OpenClaw users were analysed using structural equation modelling. The results show that positive perceptions strengthen users' attitudes toward OpenClaw, which increase behavioural intention, whereas negative perceptions increase distrust and reduce intention to use the system. The study provides insights into the psychological mechanisms influencing the adoption of autonomous AI agents.
Chinese Translation
本研究通过认知-情感-行为(Cognition--Affect--Conation, CAC)框架考察用户使用OpenClaw的行为意图。研究探讨了系统的认知感知如何影响情感反应,并进一步塑造行为意图。促进因素包括感知个性化、感知智能和相对优势,而抑制因素则包括隐私担忧、算法不透明和感知风险。对436名OpenClaw用户的调查数据进行了结构方程模型分析。结果表明,积极的感知增强了用户对OpenClaw的态度,从而提高了行为意图,而消极的感知则增加了不信任感并降低了使用该系统的意图。本研究为影响自主人工智能代理采用的心理机制提供了见解。
cs.AI / 25 / 2603.11515
Multi-Agent Collaboration for Automated Design Exploration on High Performance Computing Systems
高性能计算系统上自动化设计探索的多智能体协作
Abstract
Today's scientific challenges, from climate modeling to Inertial Confinement Fusion design to novel material design, require exploring huge design spaces. In order to enable high-impact scientific discovery, we need to scale up our ability to test hypotheses, generate results, and learn from them rapidly. We present MADA (Multi-Agent Design Assistant), a Large Language Model (LLM) powered multi-agent framework that coordinates specialized agents for complex design workflows. A Job Management Agent (JMA) launches and manages ensemble simulations on HPC systems, a Geometry Agent (GA) generates meshes, and an Inverse Design Agent (IDA) proposes new designs informed by simulation outcomes. While general purpose, we focus development and validation on Richtmyer--Meshkov Instability (RMI) suppression, a critical challenge in Inertial Confinement Fusion. We evaluate on two complementary settings: running a hydrodynamics simulations on HPC systems, and using a pre-trained machine learning surrogate for rapid design exploration. Our results demonstrate that the MADA system successfully executes iterative design refinement, automatically improving designs toward optimal RMI suppression with minimal manual intervention. Our framework reduces cumbersome manual workflow setup, and enables automated design exploration at scale. More broadly, it demonstrates a reusable pattern for coupling reasoning, simulation, specialized tools, and coordinated workflows to accelerate scientific discovery.
Chinese Translation
当今的科学挑战,从气候建模到惯性约束聚变设计,再到新材料设计,都需要探索巨大的设计空间。为了实现高影响力的科学发现,我们需要提升快速测试假设、生成结果和从中学习的能力。我们提出了MADA(多智能体设计助手),这是一个基于大型语言模型(LLM)的多智能体框架,协调专门的智能体以应对复杂的设计工作流程。作业管理智能体(JMA)在高性能计算(HPC)系统上启动并管理集合模拟,几何智能体(GA)生成网格,而逆向设计智能体(IDA)则根据模拟结果提出新的设计。尽管具有通用性,我们的开发和验证主要集中在Richtmyer-Meshkov不稳定性(RMI)抑制上,这是惯性约束聚变中的一个关键挑战。我们在两个互补的设置中进行评估:在HPC系统上运行流体动力学模拟,以及使用预训练的机器学习替代模型进行快速设计探索。我们的结果表明,MADA系统成功地执行了迭代设计优化,自动改进设计以实现最佳的RMI抑制,且手动干预最小。我们的框架减少了繁琐的手动工作流程设置,并实现了大规模的自动化设计探索。更广泛地说,它展示了一种可重用的模式,用于将推理、模拟、专业工具和协调工作流程结合起来,以加速科学发现。
cs.AI / 26 / 2603.11535
Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing
具有动态计算分配和负载均衡的自回归语言建模专家阈值路由
Abstract
Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computation allocation while achieving load balance without auxiliary losses. This fully causal mechanism eliminates dependence on other tokens in the batch, making it well-suited for autoregressive language modeling. In pretraining experiments scaling to 2.4B parameters on FineWeb-Edu, ET achieves 0.067 lower cross-entropy loss than TC-MoE, equivalent to reaching the same performance with 1.6$\times$ fewer tokens.
Chinese Translation
Token-choice Mixture-of-Experts (TC-MoE) 将每个标记路由到固定数量的专家,这限制了动态计算分配,并需要辅助损失来维持负载平衡。我们提出了专家阈值(Expert Threshold, ET)路由,其中每个专家维护一个基于全局标记分布估计的指数移动平均(EMA)阈值。在训练和推理过程中,如果每个标记的得分超过专家的阈值,则该标记独立路由到该专家,从而实现动态计算分配,同时在没有辅助损失的情况下实现负载平衡。这种完全因果的机制消除了对批次中其他标记的依赖,使其非常适合自回归语言建模。在对FineWeb-Edu进行的预训练实验中,参数规模达到24亿时,ET的交叉熵损失比TC-MoE低0.067,相当于用1.6倍更少的标记达到相同的性能。
cs.AI / 27 / 2603.11559
AI Knows What's Wrong But Cannot Fix It: Helicoid Dynamics in Frontier LLMs Under High-Stakes Decisions
人工智能知道问题所在但无法解决:高风险决策下前沿大型语言模型中的螺旋动力学
Abstract
Large language models perform reliably when their outputs can be checked: solving equations, writing code, retrieving facts. They perform differently when checking is impossible, as when a clinician chooses an irreversible treatment on incomplete data, or an investor commits capital under fundamental uncertainty. Helicoid dynamics is the name given to a specific failure regime in that second domain: a system engages competently, drifts into error, accurately names what went wrong, then reproduces the same pattern at a higher level of sophistication, recognizing it is looping and continuing nonetheless. This prospective case series documents that regime across seven leading systems (Claude, ChatGPT, Gemini, Grok, DeepSeek, Perplexity, Llama families), tested across clinical diagnosis, investment evaluation, and high-consequence interview scenarios. Despite explicit protocols designed to sustain rigorous partnership, all exhibited the pattern. When confronted with it, they attributed its persistence to structural factors in their training, beyond what conversation can reach. Under high stakes, when being rigorous and being comfortable diverge, these systems tend toward comfort, becoming less reliable precisely when reliability matters most. Twelve testable hypotheses are proposed, with implications for agentic AI oversight and human-AI collaboration. The helicoid is tractable. Identifying it, naming it, and understanding its boundary conditions are the necessary first steps toward LLMs that remain trustworthy partners precisely when the decisions are hardest and the stakes are highest.
Chinese Translation
大型语言模型在其输出可以被检查时表现可靠:解决方程、编写代码、检索事实。当检查不可能时,它们的表现则有所不同,例如当临床医生在不完整数据上选择不可逆转的治疗,或投资者在基本不确定性下投入资本时。螺旋动力学是指在第二个领域中出现的一种特定失败模式:系统能够胜任地参与,逐渐陷入错误,准确指出出错的原因,然后在更高的复杂性水平上重现相同的模式,意识到自己在循环中却仍然继续。该前瞻性案例系列记录了七个领先系统(Claude、ChatGPT、Gemini、Grok、DeepSeek、Perplexity、Llama系列)在临床诊断、投资评估和高后果面试场景中的表现,尽管有明确的协议旨在维持严格的合作,但所有系统都表现出这一模式。当面对这一模式时,它们将其持续性归因于训练中的结构性因素,超出了对话所能触及的范围。在高风险情况下,当严谨性与舒适性发生分歧时,这些系统倾向于选择舒适,恰恰在可靠性最为重要时变得不那么可靠。提出了十二个可测试的假设,涉及代理人工智能监督和人机协作的影响。螺旋是可处理的。识别、命名并理解其边界条件是实现大型语言模型在决策最困难、风险最高时仍能保持可信赖伙伴的必要第一步。
cs.AI / 28 / 2603.11594
Leveraging Large Language Models and Survival Analysis for Early Prediction of Chemotherapy Outcomes
利用大型语言模型和生存分析进行化疗结果的早期预测
Abstract
Chemotherapy for cancer treatment is costly and accompanied by severe side effects, highlighting the critical need for early prediction of treatment outcomes to improve patient management and informed decision-making. Predictive models for chemotherapy outcomes using real-world data face challenges, including the absence of explicit phenotypes and treatment outcome labels such as cancer progression and toxicity. This study addresses these challenges by employing Large Language Models (LLMs) and ontology-based techniques for phenotypes and outcome label extraction from patient notes. We focused on one of the most frequently occurring cancers, breast cancer, due to its high prevalence and significant variability in patient response to treatment, making it a critical area for improving predictive modeling. The dataset included features such as vitals, demographics, staging, biomarkers, and performance scales. Drug regimens and their combinations were extracted from the chemotherapy plans in the EMR data and shortlisted based on NCCN guidelines, verified with NIH standards, and analyzed through survival modeling. The proposed approach significantly reduced phenotypes sparsity and improved predictive accuracy. Random Survival Forest was used to predict time-to-failure, achieving a C-index of 73%, and utilized as a classifier at a specific time point to predict treatment outcomes, with accuracy and F1 scores above 70%. The outcome probabilities were validated for reliability by calibration curves. We extended our approach to four other cancer types. This research highlights the potential of early prediction of treatment outcomes using LLM-based clinical data extraction enabling personalized treatment plans with better patient outcomes.
Chinese Translation
癌症治疗的化疗费用高昂且伴随严重的副作用,这突显了早期预测治疗结果以改善患者管理和知情决策的关键需求。使用真实世界数据的化疗结果预测模型面临诸多挑战,包括缺乏明确的表型和治疗结果标签,如癌症进展和毒性。本研究通过采用大型语言模型(LLMs)和基于本体的技术,从患者记录中提取表型和结果标签,来应对这些挑战。我们专注于最常见的癌症之一——乳腺癌,因其高发病率和患者对治疗反应的显著差异,使其成为改善预测建模的关键领域。数据集包括生命体征、人口统计学、分期、生物标志物和功能评分等特征。药物方案及其组合从电子病历(EMR)数据中的化疗计划中提取,并根据国家综合癌症网络(NCCN)指南进行筛选,经过美国国立卫生研究院(NIH)标准验证,并通过生存建模进行分析。所提出的方法显著减少了表型稀疏性并提高了预测准确性。随机生存森林(Random Survival Forest)被用于预测失败时间,达到了73%的C指数,并在特定时间点用作分类器以预测治疗结果,其准确率和F1分数均超过70%。结果概率通过校准曲线验证了其可靠性。我们将该方法扩展到其他四种癌症类型。本研究强调了利用基于LLM的临床数据提取进行治疗结果早期预测的潜力,从而实现个性化治疗方案并改善患者结果。
cs.AI / 29 / 2603.11601
See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay
观察、符号化、行动:利用空间表示为更好的游戏体验奠定视觉语言模型基础
Abstract
Vision-Language Models (VLMs) excel at describing visual scenes, yet struggle to translate perception into precise, grounded actions. We investigate whether providing VLMs with both the visual frame and the symbolic representation of the scene can improve their performance in interactive environments. We evaluate three state-of-the-art VLMs across Atari games, VizDoom, and AI2-THOR, comparing frame-only, frame with self-extracted symbols, frame with ground-truth symbols, and symbol-only pipelines. Our results indicate that all models benefit when the symbolic information is accurate. However, when VLMs extract symbols themselves, performance becomes dependent on model capability and scene complexity. We further investigate how accurately VLMs can extract symbolic information from visual inputs and how noise in these symbols affects decision-making and gameplay performance. Our findings reveal that symbolic grounding is beneficial in VLMs only when symbol extraction is reliable, and highlight perception quality as a central bottleneck for future VLM-based agents.
Chinese Translation
视觉语言模型(VLMs)在描述视觉场景方面表现出色,但在将感知转化为精确且有根基的行动时却面临挑战。我们研究了是否通过同时提供视觉帧和场景的符号表示,可以提升VLMs在互动环境中的表现。我们在Atari游戏、VizDoom和AI2-THOR上评估了三种最先进的VLMs,比较了仅使用帧、帧与自提取符号、帧与真实符号以及仅使用符号的不同管道。我们的结果表明,当符号信息准确时,所有模型均受益。然而,当VLMs自行提取符号时,性能则依赖于模型能力和场景复杂性。我们进一步研究了VLMs从视觉输入中提取符号信息的准确性,以及这些符号中的噪声如何影响决策和游戏表现。我们的发现揭示,符号基础在VLMs中只有在符号提取可靠时才是有益的,并强调感知质量是未来基于VLM的代理的核心瓶颈。
cs.AI / 30 / 2603.11623
The Density of Cross-Persistence Diagrams and Its Applications
交叉持久性图的密度及其应用
Abstract
Topological Data Analysis (TDA) provides powerful tools to explore the shape and structure of data through topological features such as clusters, loops, and voids. Persistence diagrams are a cornerstone of TDA, capturing the evolution of these features across scales. While effective for analyzing individual manifolds, persistence diagrams do not account for interactions between pairs of them. Cross-persistence diagrams (cross-barcodes), introduced recently, address this limitation by characterizing relationships between topological features of two point clouds. In this work, we present the first systematic study of the density of cross-persistence diagrams. We prove its existence, establish theoretical foundations for its statistical use, and design the first machine learning framework for predicting cross-persistence density directly from point cloud coordinates and distance matrices. Our statistical approach enables the distinction of point clouds sampled from different manifolds by leveraging the linear characteristics of cross-persistence diagrams. Interestingly, we find that introducing noise can enhance our ability to distinguish point clouds, uncovering its novel utility in TDA applications. We demonstrate the effectiveness of our methods through experiments on diverse datasets, where our approach consistently outperforms existing techniques in density prediction and achieves superior results in point cloud distinction tasks. Our findings contribute to a broader understanding of cross-persistence diagrams and open new avenues for their application in data analysis, including potential insights into time-series domain tasks and the geometry of AI-generated texts. Our code is publicly available at https://github.com/Verdangeta/TDA_experiments
Chinese Translation
拓扑数据分析(TDA)提供了强大的工具,通过聚类、循环和空洞等拓扑特征探索数据的形状和结构。持久性图是TDA的基石,捕捉这些特征在不同尺度上的演变。尽管持久性图在分析单个流形时有效,但它们并未考虑流形之间的相互作用。最近提出的交叉持久性图(cross-persistence diagrams,交叉条形码)解决了这一局限性,通过表征两个点云的拓扑特征之间的关系。在本研究中,我们首次系统地研究了交叉持久性图的密度。我们证明了其存在性,为其统计使用建立了理论基础,并设计了第一个机器学习框架,能够直接从点云坐标和距离矩阵预测交叉持久性密度。我们的方法通过利用交叉持久性图的线性特征,使得能够区分来自不同流形的点云。有趣的是,我们发现引入噪声可以增强我们区分点云的能力,揭示了其在TDA应用中的新颖用途。我们通过对多样化数据集的实验展示了我们方法的有效性,其中我们的方法在密度预测上始终优于现有技术,并在点云区分任务中取得了更优的结果。我们的研究结果有助于更广泛地理解交叉持久性图,并为其在数据分析中的应用开辟了新的途径,包括对时间序列领域任务和AI生成文本几何的潜在洞察。我们的代码已公开,地址为 https://github.com/Verdangeta/TDA_experiments
cs.AI / 31 / 2603.11631
VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought
VisDoT:通过类人解释的基础和思维分解增强视觉推理
Abstract
Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
Chinese Translation
大型视觉-语言模型(LVLMs)在图表中可靠地检测视觉原件并将其与语义表示对齐方面存在困难,这严重限制了它们在复杂视觉推理中的表现。这种缺乏感知基础构成了基于图表推理的主要瓶颈。我们提出了VisDoT,一个通过类人解释基础增强视觉推理的框架。我们基于图形感知理论形式化了四个感知任务,包括位置和长度。在此基础上,我们引入了思维分解(Decomposition-of-Thought, DoT)提示,它将问题顺序地分解为视觉感知子问题和逻辑子问题。使用VisDoT对InternVL进行微调,在ChartQA上实现了+11.2%的提升,并在更具挑战性的ChartQAPro基准上超越了GPT-4o。在新引入的VisDoTQA基准上,该模型提升了+33.2%。此外,在多样化的开放领域视觉问答基准上,持续的零-shot提升确认了感知-逻辑分离策略在视觉问答中的普遍适用性。VisDoT利用类人的感知来增强视觉基础,实现了最先进的图表理解和可解释的视觉推理。
cs.AI / 32 / 2603.11679
LLMs can construct powerful representations and streamline sample-efficient supervised learning
大型语言模型可以构建强大的表征并简化样本高效的监督学习
Abstract
As real-world datasets become increasingly complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data for downstream tasks, such as time-series, free text, and structured records, often requires non-trivial domain-specific engineering. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric-based approaches significantly outperform traditional count-feature models, naive text-serialization-based LLM baselines, and a clinical foundation model, which is pretrained on orders of magnitude more data. Beyond performance, rubrics offer several advantages for operational healthcare settings such as being easy to audit, cost-effectiveness to deploy at scale, and they can be converted to tabular representations that unlock a swath of machine learning techniques.
Chinese Translation
随着现实世界数据集变得越来越复杂和异质,监督学习常常受到输入表征设计的瓶颈。对多模态数据进行建模以用于下游任务,例如时间序列、自由文本和结构化记录,通常需要非平凡的领域特定工程。我们提出了一种代理管道来简化这一过程。首先,一个大型语言模型(LLM)在上下文中分析一小部分但多样化的文本序列化输入示例,以合成一个全球性标准,这个标准作为提取和组织证据的程序化规范。然后,利用这个标准将输入的简单文本序列化转换为更标准化的格式,以供下游模型使用。我们还描述了局部标准,这些是由LLM生成的任务条件摘要。在EHRSHOT基准的15个临床任务中,我们基于标准的方法显著优于传统的计数特征模型、基于简单文本序列化的LLM基线和一个在数量级上更多数据上预训练的临床基础模型。除了性能之外,标准在运营医疗环境中还提供了几个优势,例如易于审计、在规模上部署的成本效益,并且可以转换为表格表示,从而解锁一系列机器学习技术。
cs.AI / 33 / 2603.11689
Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks
用于验证和增强多模态大语言模型在零-shot 任务中的显式逻辑通道
Abstract
Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.
Chinese Translation
前沿多模态大语言模型(MLLMs)在视觉-语言理解(VLC)任务中展现出卓越的能力。然而,它们通常以黑箱的方式作为零-shot 解决方案部署到新任务中。验证和理解这些模型的行为对于应用于新任务变得至关重要。我们提出了一种显式逻辑通道,与黑箱模型通道并行,以执行显式逻辑推理用于模型验证、选择和增强。前沿 MLLM 封装了潜在的视觉-语言知识,可以视为隐式逻辑通道。所提出的显式逻辑通道模拟人类逻辑推理,结合了一个大语言模型(LLM)、一个视觉特征模型(VFM)和基于概率推理的逻辑推理,用于对显式视觉证据进行事实、反事实和关系推理。我们提出了一种一致性率(CR)用于跨通道验证和模型选择,即使在没有真实标注的情况下。此外,跨通道集成进一步提高了 MLLMs 在零-shot 任务中的表现,以显式视觉证据为基础增强可信度。我们在三个具有挑战性的基准上对两个代表性的 VLC 任务(即 MC-VQA 和 HC-REC)进行了全面实验,使用了来自四个前沿家族的 11 个最新开源 MLLMs。我们的系统评估展示了所提出的显式逻辑通道(ELC)和一致性率(CR)在增强可解释性和可信度的 MLLMs 上进行模型验证、选择和改进的有效性。
cs.AI / 34 / 2603.11691
STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-task Multi-agent Reinforcement Learning
STAIRS-Former:具有交错递归结构的时空注意力变换器用于离线多任务多智能体强化学习
Abstract
Offline multi-agent reinforcement learning (MARL) with multi-task datasets is challenging due to varying numbers of agents across tasks and the need to generalize to unseen scenarios. Prior works employ transformers with observation tokenization and hierarchical skill learning to address these issues. However, they underutilize the transformer attention mechanism for inter-agent coordination and rely on a single history token, which limits their ability to capture long-horizon temporal dependencies in partially observable MARL settings. In this paper, we propose STAIRS-Former, a transformer architecture augmented with spatial and temporal hierarchies that enables effective attention over critical tokens while capturing long interaction histories. We further introduce token dropout to enhance robustness and generalization across varying agent populations. Extensive experiments on diverse multi-agent benchmarks, including SMAC, SMAC-v2, MPE, and MaMuJoCo, with multi-task datasets demonstrate that STAIRS-Former consistently outperforms prior methods and achieves new state-of-the-art performance.
Chinese Translation
离线多智能体强化学习(MARL)在多任务数据集上面临挑战,因为不同任务中的智能体数量各异,并且需要对未见场景进行泛化。之前的研究采用了带有观察令牌化和层次技能学习的变换器来解决这些问题。然而,它们未充分利用变换器的注意力机制进行智能体间协调,并依赖于单一的历史令牌,这限制了它们在部分可观察的MARL环境中捕捉长期时间依赖的能力。在本文中,我们提出了STAIRS-Former,一种增强了空间和时间层次的变换器架构,能够在捕捉长时间交互历史的同时,对关键令牌进行有效注意力处理。我们进一步引入了令牌丢弃,以增强在不同智能体群体中的鲁棒性和泛化能力。在包括SMAC、SMAC-v2、MPE和MaMuJoCo在内的多样化多智能体基准测试上进行的广泛实验表明,STAIRS-Former始终优于之前的方法,并实现了新的最先进性能。
cs.AI / 35 / 2603.11709
Scaling Laws for Educational AI Agents
教育人工智能代理的规模法则
Abstract
While scaling laws for Large Language Models (LLMs) have been extensively studied along dimensions of model parameters, training data, and compute, the scaling behavior of LLM-based educational agents remains unexplored. We propose that educational agent capability scales not merely with the underlying model size, but through structured dimensions that we collectively term the Agent Scaling Law: role definition clarity, skill depth, tool completeness, runtime capability, and educator expertise injection. Central to this framework is AgentProfile, a structured JSON-based specification that serves as the mechanism enabling systematic capability growth of educational agents. We present EduClaw, a profile-driven multi-agent platform that operationalizes this scaling law, demonstrating its effectiveness through the construction and deployment of 330+ educational agent profiles encompassing 1,100+ skill modules across K-12 subjects. Our empirical observations suggest that educational agent performance scales predictably with profile structural richness. We identify two complementary scaling axes -- Tool Scaling and Skill Scaling -- as future directions, arguing that the path to more capable educational AI lies not solely in larger models, but in stronger structured capability systems.
Chinese Translation
尽管对大型语言模型(LLMs)的规模法则在模型参数、训练数据和计算能力等维度上进行了广泛研究,但基于LLM的教育代理的规模行为仍未得到探索。我们提出,教育代理的能力不仅与基础模型的大小相关,还通过我们统称为代理规模法则的结构化维度进行扩展:角色定义清晰度、技能深度、工具完整性、运行时能力和教育者专业知识注入。该框架的核心是AgentProfile,这是一种基于结构化JSON的规范,作为实现教育代理系统能力增长的机制。我们展示了EduClaw,这是一个以配置文件为驱动的多代理平台,能够实现这一规模法则,通过构建和部署330多个教育代理配置文件,涵盖1100多个K-12学科的技能模块,展示其有效性。我们的实证观察表明,教育代理的性能与配置文件的结构丰富性呈可预测的规模关系。我们确定了两个互补的规模轴——工具规模和技能规模——作为未来的研究方向,认为更强大的教育人工智能的路径不仅在于更大的模型,而在于更强的结构化能力系统。
cs.AI / 36 / 2603.11721
When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows
当 OpenClaw 遇见医院:面向动态临床工作流程的代理操作系统
Abstract
Large language model (LLM) agents extend conventional generative models by integrating reasoning, tool invocation, and persistent memory. Recent studies suggest that such agents may significantly improve clinical workflows by automating documentation, coordinating care processes, and assisting medical decision making. However, despite rapid progress, deploying autonomous agents in healthcare environments remains difficult due to reliability limitations, security risks, and insufficient long-term memory mechanisms. This work proposes an architecture that adapts LLM agents for hospital environments. The design introduces four core components: a restricted execution environment inspired by Linux multi-user systems, a document-centric interaction paradigm connecting patient and clinician agents, a page-indexed memory architecture designed for long-term clinical context management, and a curated medical skills library enabling ad-hoc composition of clinical task sequences. Rather than granting agents unrestricted system access, the architecture constrains actions through predefined skill interfaces and resource isolation. We argue that such a system forms the basis of an Agentic Operating System for Hospital, a computing layer capable of coordinating clinical workflows while maintaining safety, transparency, and auditability. This work grounds the design in OpenClaw, an open-source autonomous agent framework that structures agent capabilities as a curated library of discrete skills, and extends it with the infrastructure-level constraints required for safe clinical deployment.
Chinese Translation
大型语言模型(LLM)代理通过集成推理、工具调用和持久记忆,扩展了传统生成模型。近期研究表明,这类代理可能通过自动化文档处理、协调护理流程和辅助医疗决策显著改善临床工作流程。然而,尽管取得了快速进展,在医疗环境中部署自主代理仍然面临可靠性限制、安全风险和长期记忆机制不足等困难。本研究提出了一种适应医院环境的 LLM 代理架构。该设计引入了四个核心组件:受 Linux 多用户系统启发的受限执行环境、连接患者和临床医生代理的文档中心交互范式、旨在长期临床上下文管理的页面索引记忆架构,以及一个支持临床任务序列临时组合的策划医疗技能库。该架构通过预定义的技能接口和资源隔离来限制代理的行为,而不是授予其无限制的系统访问权限。我们认为,这样的系统构成了医院代理操作系统的基础,这是一个能够协调临床工作流程,同时保持安全性、透明性和可审计性的计算层。本研究以 OpenClaw 为基础,OpenClaw 是一个开源自主代理框架,将代理能力结构化为策划的离散技能库,并扩展了安全临床部署所需的基础设施级约束。
cs.AI / 37 / 2603.11736
Gender Bias in Generative AI-assisted Recruitment Processes
生成性人工智能辅助招聘过程中的性别偏见
Abstract
In recent years, generative artificial intelligence (GenAI) systems have assumed increasingly crucial roles in selection processes, personnel recruitment and analysis of candidates' profiles. However, the employment of large language models (LLMs) risks reproducing, and in some cases amplifying, gender stereotypes and bias already present in the labour market. The objective of this paper is to evaluate and measure this phenomenon, analysing how a state-of-the-art generative model (GPT-5) suggests occupations based on gender and work experience background, focusing on under-35-year-old Italian graduates. The model has been prompted to suggest jobs to 24 simulated candidate profiles, which are balanced in terms of gender, age, experience and professional field. Although no significant differences emerged in job titles and industry, gendered linguistic patterns emerged in the adjectives attributed to female and male candidates, indicating a tendency of the model to associate women with emotional and empathetic traits, while men with strategic and analytical ones. The research raises an ethical question regarding the use of these models in sensitive processes, highlighting the need for transparency and fairness in future digital labour markets.
Chinese Translation
近年来,生成性人工智能(GenAI)系统在选拔过程、人员招聘和候选人档案分析中扮演了越来越重要的角色。然而,使用大型语言模型(LLMs)可能会重现并在某些情况下放大劳动市场中已经存在的性别刻板印象和偏见。本文的目的是评估和测量这一现象,分析一种最先进的生成模型(GPT-5)如何根据性别和工作经验背景建议职业,重点关注35岁以下的意大利毕业生。该模型被要求为24个模拟候选人档案建议工作,这些档案在性别、年龄、经验和专业领域上保持平衡。尽管在职位名称和行业上没有显著差异,但在赋予女性和男性候选人的形容词中出现了性别化的语言模式,表明模型倾向于将女性与情感和同理心特质联系在一起,而将男性与战略和分析特质联系在一起。该研究提出了一个伦理问题,即在敏感过程中使用这些模型的合理性,强调了未来数字劳动市场中透明性和公平性的必要性。
cs.AI / 38 / 2603.11745
CINDI: Conditional Imputation and Noisy Data Integrity with Flows in Power Grid Data
CINDI:电网数据中的条件插补与噪声数据完整性
Abstract
Real-world multivariate time series, particularly in critical infrastructure such as electrical power grids, are often corrupted by noise and anomalies that degrade the performance of downstream tasks. Standard data cleaning approaches often rely on disjoint strategies, which involve detecting errors with one model and imputing them with another. Such approaches can fail to capture the full joint distribution of the data and ignore prediction uncertainty. This work introduces Conditional Imputation and Noisy Data Integrity (CINDI), an unsupervised probabilistic framework designed to restore data integrity in complex time series. Unlike fragmented approaches, CINDI unifies anomaly detection and imputation into a single end-to-end system built on conditional normalizing flows. By modeling the exact conditional likelihood of the data, the framework identifies low-probability segments and iteratively samples statistically consistent replacements. This allows CINDI to efficiently reuse learned information while preserving the underlying physical and statistical properties of the system. We evaluate the framework using real-world grid loss data from a Norwegian power distribution operator, though the methodology is designed to generalize to any multivariate time series domain. The results demonstrate that CINDI yields robust performance compared to competitive baselines, offering a scalable solution for maintaining reliability in noisy environments.
Chinese Translation
现实世界中的多变量时间序列,特别是在电力等关键基础设施中,常常受到噪声和异常的干扰,这会降低下游任务的性能。标准的数据清理方法通常依赖于不相交的策略,即使用一个模型检测错误,再用另一个模型进行插补。这种方法可能无法捕捉数据的完整联合分布,并忽视预测的不确定性。本研究提出了条件插补与噪声数据完整性(CINDI),这是一个无监督的概率框架,旨在恢复复杂时间序列中的数据完整性。与碎片化的方法不同,CINDI将异常检测和插补统一为一个基于条件归一化流的端到端系统。通过建模数据的确切条件似然,该框架识别低概率段并迭代地抽样统计一致的替代值。这使得CINDI能够高效地重用学习到的信息,同时保持系统的基本物理和统计特性。我们使用来自挪威电力分配运营商的实际电网损失数据评估该框架,尽管该方法旨在推广到任何多变量时间序列领域。结果表明,CINDI相比竞争基线表现出强大的性能,提供了一种可扩展的解决方案,以在噪声环境中维持可靠性。
cs.AI / 39 / 2603.11756
Anomaly detection in time-series via inductive biases in the latent space of conditional normalizing flows
通过条件归一化流的潜在空间中的归纳偏差进行时间序列异常检测
Abstract
Deep generative models for anomaly detection in multivariate time-series are typically trained by maximizing data likelihood. However, likelihood in observation space measures marginal density rather than conformity to structured temporal dynamics, and therefore can assign high probability to anomalous or out-of-distribution samples. We address this structural limitation by relocating the notion of anomaly to a prescribed latent space. We introduce explicit inductive biases in conditional normalizing flows, modeling time-series observations within a discrete-time state-space framework that constrains latent representations to evolve according to prescribed temporal dynamics. Under this formulation, expected behavior corresponds to compliance with a specified distribution over latent trajectories, while anomalies are defined as violations of these dynamics. Anomaly detection is consequently reduced to a statistically grounded compliance test, such that observations are mapped to latent space and evaluated via goodness-of-fit tests against the prescribed latent evolution. This yields a principled decision rule that remains effective even in regions of high observation likelihood. Experiments on synthetic and real-world time-series demonstrate reliable detection of anomalies in frequency, amplitude, and observation noise, while providing interpretable diagnostics of model compliance.
Chinese Translation
多变量时间序列中的异常检测通常通过最大化数据似然性来训练深度生成模型。然而,观察空间中的似然性测量的是边际密度,而不是对结构化时间动态的符合性,因此可能会对异常或分布外样本分配高概率。我们通过将异常的概念转移到规定的潜在空间来解决这一结构性限制。我们在条件归一化流中引入明确的归纳偏差,在离散时间状态空间框架内建模时间序列观察,约束潜在表示根据规定的时间动态演变。在这种表述下,预期行为对应于符合潜在轨迹上指定分布的要求,而异常则被定义为对这些动态的违反。因此,异常检测被简化为一个基于统计的合规性测试,使得观察被映射到潜在空间,并通过与规定的潜在演变的拟合优度测试进行评估。这产生了一个原则性的决策规则,即使在高观察似然性区域也能保持有效。对合成和真实世界时间序列的实验表明,在频率、幅度和观察噪声方面可靠地检测异常,同时提供可解释的模型合规性诊断。
cs.AI / 40 / 2603.11767
Understanding Wikidata Qualifiers: An Analysis and Taxonomy
理解Wikidata限定符:分析与分类
Abstract
This paper presents an in-depth analysis of Wikidata qualifiers, focusing on their semantics and actual usage, with the aim of developing a taxonomy that addresses the challenges of selecting appropriate qualifiers, querying the graph, and making logical inferences. The study evaluates qualifier importance based on frequency and diversity, using a modified Shannon entropy index to account for the "long tail" phenomenon. By analyzing a Wikidata dump, the top 300 qualifiers were selected and categorized into a refined taxonomy that includes contextual, epistemic/uncertainty, structural, and additional qualifiers. The taxonomy aims to guide contributors in creating and querying statements, improve qualifier recommendation systems, and enhance knowledge graph design methodologies. The results show that the taxonomy effectively covers the most important qualifiers and provides a structured approach to understanding and utilizing qualifiers in Wikidata.
Chinese Translation
本文对Wikidata限定符进行了深入分析,重点关注其语义和实际使用,旨在开发一个分类体系,以应对选择适当限定符、查询图谱和进行逻辑推理的挑战。研究基于频率和多样性评估限定符的重要性,使用修改后的香农熵指数来考虑“长尾”现象。通过分析Wikidata数据转储,选取了前300个限定符,并将其分类为一个精细化的分类体系,包括上下文限定符、认识论/不确定性限定符、结构限定符和其他限定符。该分类体系旨在指导贡献者创建和查询陈述,改善限定符推荐系统,并增强知识图谱设计方法论。结果表明,该分类体系有效覆盖了最重要的限定符,并提供了一种结构化的方法来理解和利用Wikidata中的限定符。
cs.AI / 41 / 2603.11768
Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework
管理大型语言模型代理中的演变记忆:风险、机制及稳定性与安全性治理记忆框架(SSGM)
Abstract
Long-term memory has emerged as a foundational component of autonomous Large Language Model (LLM) agents, enabling continuous adaptation, lifelong multimodal learning, and sophisticated reasoning. However, as memory systems transition from static retrieval databases to dynamic, agentic mechanisms, critical concerns regarding memory governance, semantic drift, and privacy vulnerabilities have surfaced. While recent surveys have focused extensively on memory retrieval efficiency, they largely overlook the emergent risks of memory corruption in highly dynamic environments. To address these emerging challenges, we propose the Stability and Safety-Governed Memory (SSGM) framework, a conceptual governance architecture. SSGM decouples memory evolution from execution by enforcing consistency verification, temporal decay modeling, and dynamic access control prior to any memory consolidation. Through formal analysis and architectural decomposition, we show how SSGM can mitigate topology-induced knowledge leakage where sensitive contexts are solidified into long-term storage, and help prevent semantic drift where knowledge degrades through iterative summarization. Ultimately, this work provides a comprehensive taxonomy of memory corruption risks and establishes a robust governance paradigm for deploying safe, persistent, and reliable agentic memory systems.
Chinese Translation
长期记忆已成为自主大型语言模型(LLM)代理的基础组成部分,使其能够实现持续适应、终身多模态学习和复杂推理。然而,随着记忆系统从静态检索数据库转变为动态的代理机制,关于记忆治理、语义漂移和隐私漏洞的关键问题开始浮现。尽管近期的调查广泛关注记忆检索效率,但它们在很大程度上忽视了在高度动态环境中记忆腐败的潜在风险。为应对这些新出现的挑战,我们提出了稳定性与安全性治理记忆(SSGM)框架,这是一种概念性治理架构。SSGM通过在任何记忆整合之前强制执行一致性验证、时间衰减建模和动态访问控制,将记忆演变与执行解耦。通过形式分析和架构分解,我们展示了SSGM如何减轻由拓扑引起的知识泄漏,其中敏感上下文被固化到长期存储中,并帮助防止知识通过迭代总结而退化的语义漂移。最终,这项工作提供了记忆腐败风险的全面分类,并建立了一个强大的治理范式,以部署安全、持久和可靠的代理记忆系统。
cs.AI / 42 / 2603.11770
An Automatic Text Classification Method Based on Hierarchical Taxonomies, Neural Networks and Document Embedding: The NETHIC Tool
基于层次分类法、神经网络和文档嵌入的自动文本分类方法:NETHIC工具
Abstract
This work describes an automatic text classification method implemented in a software tool called NETHIC, which takes advantage of the inner capabilities of highly-scalable neural networks combined with the expressiveness of hierarchical taxonomies. As such, NETHIC succeeds in bringing about a mechanism for text classification that proves to be significantly effective as well as efficient. The tool had undergone an experimentation process against both a generic and a domain-specific corpus, outputting promising results. On the basis of this experimentation, NETHIC has been now further refined and extended by adding a document embedding mechanism, which has shown improvements in terms of performance on the individual networks and on the whole hierarchical model.
Chinese Translation
本研究描述了一种自动文本分类方法,该方法在名为NETHIC的软件工具中实现,利用了高度可扩展的神经网络的内在能力与层次分类法的表达能力相结合。因此,NETHIC成功地提供了一种文本分类机制,证明其在有效性和效率上都显著优越。该工具经过了针对通用和特定领域语料库的实验过程,输出了令人鼓舞的结果。在此实验的基础上,NETHIC现在进一步进行了优化和扩展,增加了文档嵌入机制,这在个别网络和整个层次模型的性能上均显示出改善。
cs.AI / 43 / 2603.11781
From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts
从辩论到审议:具有类型化认知行为的结构化集体推理
Abstract
Multi-agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration. None model deliberation: a phased process where differentiated participants exchange typed reasoning moves, preserve disagreements, and converge on accountable outcomes. We introduce Deliberative Collective Intelligence (DCI), specifying four reasoning archetypes, 14 typed epistemic acts, a shared workspace, and DCI-CF, a convergent flow algorithm that guarantees termination with a structured decision packet containing the selected option, residual objections, minority report, and reopen conditions. We evaluate on 45 tasks across seven domains using Gemini 2.5 Flash. On non-routine tasks (n=40), DCI significantly improves over unstructured debate (+0.95, 95% CI [+0.41, +1.54]). DCI excels on hidden-profile tasks requiring perspective integration (9.56, highest of any system on any domain) while failing on routine decisions (5.39), confirming task-dependence. DCI produces 100% structured decision packets and 98% minority reports, artifacts absent from all baselines. However, DCI consumes ~62x single-agent tokens, and single-agent generation outperforms DCI on overall quality. DCI's contribution is not that more agents are better, but that consequential decisions benefit from deliberative structure when process accountability justifies the cost.
Chinese Translation
多智能体大语言模型(LLM)系统越来越多地处理复杂推理,但它们的交互模式仍限于投票、非结构化辩论或管道编排。没有任何模型能够进行审议:这是一个阶段性过程,其中不同的参与者交换类型化的推理行为,保留分歧,并在可问责的结果上达成共识。我们提出了审议集体智能(Deliberative Collective Intelligence, DCI),具体定义了四种推理原型、14种类型化认知行为、一个共享工作空间,以及DCI-CF,一个收敛流算法,保证以包含所选选项、剩余异议、少数报告和重新开启条件的结构化决策包终止。我们在七个领域的45个任务上进行了评估,使用Gemini 2.5 Flash。在非例行任务(n=40)中,DCI显著优于非结构化辩论(+0.95,95%置信区间 [+0.41, +1.54])。DCI在需要视角整合的隐性特征任务上表现出色(9.56,是任何系统在任何领域中的最高分),而在例行决策上表现不佳(5.39),确认了任务的依赖性。DCI生成100%的结构化决策包和98%的少数报告,而这些在所有基线中均未出现。然而,DCI消耗的单智能体令牌约为62倍,且单智能体生成在整体质量上优于DCI。DCI的贡献不在于更多的智能体更好,而在于当过程的问责性证明成本是合理时,重要决策受益于审议结构。
cs.AI / 44 / 2603.11798
DocSage: An Information Structuring Agent for Multi-Doc Multi-Entity Question Answering
DocSage:一种用于多文档多实体问答的信息结构化代理
Abstract
Multi-document Multi-entity Question Answering inherently demands models to track implicit logic between multiple entities across scattered documents. However, existing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks suffer from critical limitations: standard RAG's vector similarity-based coarse-grained retrieval often omits critical facts, graph-based RAG fails to efficiently integrate fragmented complex relationship networks, and both lack schema awareness, leading to inadequate cross-document evidence chain construction and inaccurate entity relationship deduction. To address these challenges, we propose DocSage, an end-to-end agentic framework that integrates dynamic schema discovery, structured information extraction, and schema-aware relational reasoning with error guarantees. DocSage operates through three core modules: (1) A schema discovery module dynamically infers query-specific minimal joinable schemas to capture essential entities and relationships; (2) An extraction module transforms unstructured text into semantically coherent relational tables, enhanced by error-aware correction mechanisms to reduce extraction errors; (3) A reasoning module performs multi-hop relational reasoning over structured tables, leveraging schema awareness to efficiently align cross-document entities and aggregate evidence. This agentic design offers three key advantages: precise fact localization via SQL-powered indexing, natural support for cross-document entity joins through relational tables, and mitigated LLM attention diffusion via structured representation. Evaluations on two MDMEQA benchmarks demonstrate that DocSage significantly outperforms state-of-the-art long-context LLMs and RAG systems, achieving more than 27% accuracy improvements respectively.
Chinese Translation
多文档多实体问答本质上要求模型跟踪散布在多个文档中的多个实体之间的隐含逻辑。然而,现有的大型语言模型(LLMs)和增强检索生成(RAG)框架存在严重的局限性:标准RAG的基于向量相似度的粗粒度检索常常遗漏关键事实,基于图的RAG未能有效整合碎片化的复杂关系网络,且两者均缺乏模式意识,导致跨文档证据链构建不足和实体关系推断不准确。为了解决这些挑战,我们提出了DocSage,这是一种端到端的代理框架,集成了动态模式发现、结构化信息提取和具有错误保证的模式感知关系推理。DocSage通过三个核心模块运行:(1)模式发现模块动态推断查询特定的最小可连接模式,以捕捉基本实体和关系;(2)提取模块将非结构化文本转化为语义连贯的关系表,并通过错误感知纠正机制减少提取错误;(3)推理模块在结构化表上执行多跳关系推理,利用模式意识有效对齐跨文档实体并聚合证据。这种代理设计提供了三个关键优势:通过SQL驱动的索引实现精确的事实定位,通过关系表自然支持跨文档实体连接,以及通过结构化表示减轻LLM注意力扩散。在两个MDMEQA基准上的评估表明,DocSage显著优于最先进的长上下文LLMs和RAG系统,分别实现了超过27%的准确率提升。
cs.AI / 45 / 2603.11802
A Semi-Decentralized Approach to Multiagent Control
一种半去中心化的多智能体控制方法
Abstract
We introduce an expressive framework and algorithms for the semi-decentralized control of cooperative agents in environments with communication uncertainty. Whereas semi-Markov control admits a distribution over time for agent actions, semi-Markov communication, or what we refer to as semi-decentralization, gives a distribution over time for what actions and observations agents can store in their histories. We extend semi-decentralization to the partially observable Markov decision process (POMDP). The resulting SDec-POMDP unifies decentralized and multiagent POMDPs and several existing explicit communication mechanisms. We present recursive small-step semi-decentralized A* (RS-SDA*), an exact algorithm for generating optimal SDec-POMDP policies. RS-SDA* is evaluated on semi-decentralized versions of several standard benchmarks and a maritime medical evacuation scenario. This paper provides a well-defined theoretical foundation for exploring many classes of multiagent communication problems through the lens of semi-decentralization.
Chinese Translation
我们提出了一种表达丰富的框架和算法,用于在具有通信不确定性的环境中对合作智能体进行半去中心化控制。半马尔可夫控制允许智能体动作在时间上存在分布,而半马尔可夫通信,或我们所称的半去中心化,则为智能体可以在其历史中存储的动作和观察提供了时间上的分布。我们将半去中心化扩展到部分可观测马尔可夫决策过程(POMDP)。由此产生的SDec-POMDP统一了去中心化和多智能体POMDP以及几种现有的显式通信机制。我们提出了递归小步半去中心化A*(RS-SDA*),这是一种生成最优SDec-POMDP策略的精确算法。RS-SDA*在几种标准基准的半去中心化版本和一个海上医疗撤离场景中进行了评估。本文为通过半去中心化的视角探索多类多智能体通信问题提供了明确的理论基础。
cs.AI / 46 / 2603.11808
Automating Skill Acquisition through Large-Scale Mining of Open-Source Agentic Repositories: A Framework for Multi-Agent Procedural Knowledge Extraction
通过大规模挖掘开源智能库自动化技能获取:多智能体程序知识提取框架
Abstract
The transition from monolithic large language models (LLMs) to modular, skill-equipped agents represents a fundamental architectural shift in artificial intelligence deployment. While general-purpose models demonstrate remarkable breadth in declarative knowledge, their utility in autonomous workflows is frequently constrained by insufficient specialized procedural expertise. This report investigates a systematic framework for automated acquisition of high-quality agent skills through mining of open-source repositories on platforms such as GitHub. We focus on the extraction of visualization and educational capabilities from state-of-the-art systems including TheoremExplainAgent and Code2Video, both utilizing the Manim mathematical animation engine. The framework encompasses repository structural analysis, semantic skill identification through dense retrieval, and translation to the standardized SKILL.md format. We demonstrate that systematic extraction from agentic repositories, combined with rigorous security governance and multi-dimensional evaluation metrics, enables scalable acquisition of procedural knowledge that augments LLM capabilities without requiring model retraining. Our analysis reveals that agent-generated educational content can achieve 40\% gains in knowledge transfer efficiency while maintaining pedagogical quality comparable to human-crafted tutorials.
Chinese Translation
从单体大型语言模型(LLMs)向模块化、具备技能的智能体的过渡,代表了人工智能部署中的一种根本性架构转变。尽管通用模型在陈述性知识方面表现出显著的广度,但它们在自主工作流程中的实用性常常受到专业程序知识不足的限制。本报告探讨了一种通过挖掘如GitHub等平台上的开源库,自动获取高质量智能体技能的系统框架。我们重点关注从包括TheoremExplainAgent和Code2Video在内的最先进系统中提取可视化和教育能力,这两个系统均利用Manim数学动画引擎。该框架包括库结构分析、通过密集检索进行语义技能识别,以及转换为标准化的SKILL.md格式。我们证明,从智能体库的系统提取,结合严格的安全治理和多维评估指标,能够实现可扩展的程序知识获取,从而增强LLM的能力,而无需对模型进行重新训练。我们的分析表明,智能体生成的教育内容在知识转移效率上可实现40%的提升,同时保持与人类制作的教程相当的教学质量。
cs.AI / 47 / 2603.11816
VisiFold: Long-Term Traffic Forecasting via Temporal Folding Graph and Node Visibility
VisiFold:通过时间折叠图和节点可见性进行长期交通预测
Abstract
Traffic forecasting is a cornerstone of intelligent transportation systems. While existing research has made significant progress in short-term prediction, long-term forecasting remains a largely uncharted and challenging frontier. Extending the prediction horizon intensifies two critical issues: escalating computational resource consumption and increasingly complex spatial-temporal dependencies. Current approaches, which rely on spatial-temporal graphs and process temporal and spatial dimensions separately, suffer from snapshot-stacking inflation and cross-step fragmentation. To overcome these limitations, we propose \textit{VisiFold}. Our framework introduces a novel temporal folding graph that consolidates a sequence of temporal snapshots into a single graph. Furthermore, we present a node visibility mechanism that incorporates node-level masking and subgraph sampling to overcome the computational bottleneck imposed by large node counts. Extensive experiments show that VisiFold not only drastically reduces resource consumption but also outperforms existing baselines in long-term forecasting tasks. Remarkably, even with a high mask ratio of 80\%, VisiFold maintains its performance advantage. By effectively breaking the resource constraints in both temporal and spatial dimensions, our work paves the way for more realistic long-term traffic forecasting. The code is available at~ https://github.com/PlanckChang/VisiFold.
Chinese Translation
交通预测是智能交通系统的基石。尽管现有研究在短期预测方面取得了显著进展,但长期预测仍然是一个未被充分探索且充满挑战的领域。延长预测时间范围加剧了两个关键问题:计算资源消耗的增加和日益复杂的时空依赖关系。目前的方法依赖于时空图,并分别处理时间和空间维度,面临快照堆叠膨胀和跨步碎片化的问题。为了解决这些局限性,我们提出了 extit{VisiFold}。我们的框架引入了一种新颖的时间折叠图,将一系列时间快照整合为一个单一图形。此外,我们提出了一种节点可见性机制,结合节点级掩蔽和子图采样,以克服大节点数量带来的计算瓶颈。大量实验表明,VisiFold不仅显著降低了资源消耗,还在长期预测任务中超越了现有基准。值得注意的是,即使在80\%的高掩蔽比率下,VisiFold仍保持其性能优势。通过有效打破时空维度上的资源限制,我们的工作为更现实的长期交通预测铺平了道路。代码可在~ https://github.com/PlanckChang/VisiFold 获取。
cs.AI / 48 / 2603.11818
Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI
基于深度学习模型和可解释人工智能的卵巢恶性病变自动检测
Abstract
The unrestrained proliferation of cells that are malignant in nature is cancer. In recent times, medical professionals are constantly acquiring enhanced diagnostic and treatment abilities by implementing deep learning models to analyze medical data for better clinical decision, disease diagnosis and drug discovery. A majority of cancers are studied and treated by incorporating these technologies. However, ovarian cancer remains a dilemma as it has inaccurate non-invasive detection procedures and a time consuming, invasive procedure for accurate detection. Thus, in this research, several Convolutional Neural Networks such as LeNet-5, ResNet, VGGNet and GoogLeNet/Inception have been utilized to develop 15 variants and choose a model that accurately detects and identifies ovarian cancer. For effective model training, the dataset OvarianCancer&SubtypesDatasetHistopathology from Mendeley has been used. After constructing a model, we utilized Explainable Artificial Intelligence (XAI) models such as LIME, Integrated Gradients and SHAP to explain the black box outcome of the selected model. For evaluating the performance of the model, Accuracy, Precision, Recall, F1-Score, ROC Curve and AUC have been used. From the evaluation, it was seen that the slightly compact InceptionV3 model with ReLu had the overall best result achieving an average score of 94% across all the performance metrics in the augmented dataset. Lastly for XAI, the three aforementioned XAI have been used for an overall comparative analysis. It is the aim of this research that the contributions of the study will help in achieving a better detection method for ovarian cancer.
Chinese Translation
恶性细胞的无限增殖即为癌症。近年来,医疗专业人员通过实施深度学习模型来分析医疗数据,从而不断提升诊断和治疗能力,以便更好地进行临床决策、疾病诊断和药物发现。大多数癌症的研究和治疗都融入了这些技术。然而,卵巢癌仍然是一个难题,因为其非侵入性检测程序不准确,而准确检测则需要耗时的侵入性程序。因此,在本研究中,利用了多种卷积神经网络(Convolutional Neural Networks),如LeNet-5、ResNet、VGGNet和GoogLeNet/Inception,开发了15个变体,并选择出一个能够准确检测和识别卵巢癌的模型。为了有效地训练模型,使用了来自Mendeley的OvarianCancer&SubtypesDatasetHistopathology数据集。在构建模型后,我们利用可解释人工智能(Explainable Artificial Intelligence, XAI)模型,如LIME、集成梯度(Integrated Gradients)和SHAP,来解释所选模型的黑箱结果。为了评估模型的性能,使用了准确率(Accuracy)、精确率(Precision)、召回率(Recall)、F1分数(F1-Score)、ROC曲线和AUC。从评估结果来看,稍微紧凑的InceptionV3模型与ReLu组合在所有性能指标的增强数据集中取得了94%的平均得分,表现最佳。最后,对于可解释人工智能,使用了上述三种XAI进行整体比较分析。本研究的目标是希望本研究的贡献能够帮助实现更好的卵巢癌检测方法。
cs.AI / 49 / 2603.11863
CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
CreativeBench:通过自我演化挑战对机器创造力进行基准测试和增强
Abstract
The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.
Chinese Translation
高质量预训练数据的饱和使得研究重点转向能够持续生成新颖工件的演化系统,这导致了AlphaEvolve的成功。然而,这类系统的进展受到缺乏严格、定量评估的制约。为了解决这一挑战,我们引入了CreativeBench,一个基于经典认知框架的代码生成机器创造力评估基准。该基准由两个子集组成——CreativeBench-Combo和CreativeBench-Explore,旨在通过利用逆向工程和自我对弈的自动化流程来针对组合创造力和探索创造力。通过利用可执行代码,CreativeBench通过一个统一的度量标准客观地区分创造力与幻觉,该度量标准定义为质量和新颖性的乘积。我们对最先进模型的分析揭示了不同的行为:(1)规模显著提高了组合创造力,但对探索的收益递减;(2)较大的模型表现出“通过规模收敛”,变得更准确但更少发散;(3)推理能力主要有利于受限探索而非组合。最后,我们提出了EvoRePE,一种即插即用的推理时引导策略,内化演化搜索模式,以持续增强机器创造力。
cs.AI / 50 / 2603.11864
Social, Legal, Ethical, Empathetic and Cultural Norm Operationalisation for AI Agents
人工智能代理的社会、法律、伦理、同理心和文化规范的操作化
Abstract
As AI agents are increasingly used in high-stakes domains like healthcare and law enforcement, aligning their behaviour with social, legal, ethical, empathetic, and cultural (SLEEC) norms has become a critical engineering challenge. While international frameworks have established high-level normative principles for AI, a significant gap remains in translating these abstract principles into concrete, verifiable requirements. To address this gap, we propose a systematic SLEEC-norm operationalisation process for determining, validating, implementing, and verifying normative requirements. Furthermore, we survey the landscape of methods and tools supporting this process, and identify key remaining challenges and research avenues for addressing them. We thus establish a framework - and define a research and policy agenda - for developing AI agents that are not only functionally useful but also demonstrably aligned with human norms and values.
Chinese Translation
随着人工智能代理在医疗和执法等高风险领域的日益应用,使其行为与社会、法律、伦理、同理心和文化(SLEEC)规范保持一致已成为一个关键的工程挑战。尽管国际框架已建立了人工智能的高层次规范原则,但在将这些抽象原则转化为具体的、可验证的要求方面仍存在显著的差距。为了解决这一差距,我们提出了一种系统的SLEEC规范操作化过程,用于确定、验证、实施和验证规范要求。此外,我们调查了支持这一过程的方法和工具的现状,并识别出解决这些问题的关键挑战和研究方向。因此,我们建立了一个框架,并定义了一个研究和政策议程,以开发不仅在功能上有用而且在可证明上与人类规范和价值观一致的人工智能代理。
cs.AI / 51 / 2603.11873
AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization
AdaFuse:通过令牌级预门控和融合内核优化加速动态适配器推理
Abstract
The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.
Chinese Translation
将动态稀疏结构(如专家混合模型 Mixture-of-Experts, MoE)与参数高效的适配器(例如 LoRA)相结合是一种增强大型语言模型(Large Language Models, LLMs)的强大技术。然而,这种架构增强带来了高昂的代价:尽管计算负载的增加微乎其微,推理延迟却常常飙升,导致解码速度降低超过 2.5 倍。通过细致的性能分析,我们确定主要瓶颈不在于计算本身,而在于传统动态路由所需的碎片化、顺序 CUDA 内核启动带来的严重开销。为了解决这一挑战,我们提出了 AdaFuse,这是一个基于算法与底层硬件系统紧密协同设计的框架,以实现高效的动态适配器执行。与传统的层级或块级路由不同,AdaFuse 采用令牌级预门控策略,在处理令牌之前为所有适配器层做出单一的全局路由决策。这种“决定一次,应用到处”的方法有效地静态化了每个令牌的执行路径,为整体优化创造了机会。我们通过开发一个自定义的 CUDA 内核来利用这一点,该内核执行融合切换操作,将所有选定的 LoRA 适配器的参数合并到主干模型中,以单次高效的方式进行处理。在流行的开源 LLM 上的实验结果表明,AdaFuse 在准确性上与最先进的动态适配器相当,同时将解码延迟大幅降低超过 2.4 倍,从而弥合了模型能力与推理效率之间的差距。
cs.AI / 52 / 2603.11936
Fair Learning for Bias Mitigation and Quality Optimization in Paper Recommendation
公平学习用于论文推荐中的偏见缓解与质量优化
Abstract
Despite frequent double-blind review, demographic biases of authors still disadvantage the underrepresented groups. We present Fair-PaperRec, a MultiLayer Perceptron (MLP)-based model that addresses demographic disparities in post-review paper acceptance decisions while maintaining high-quality requirements. Our methodology penalizes demographic disparities while preserving quality through intersectional criteria (e.g., race, country) and a customized fairness loss, in contrast to heuristic approaches. Evaluations using conference data from ACM Special Interest Group on Computer-Human Interaction (SIGCHI), Designing Interactive Systems (DIS), and Intelligent User Interfaces (IUI) indicate a 42.03% increase in underrepresented group participation and a 3.16% improvement in overall utility, indicating that diversity promotion does not compromise academic rigor and supports equity-focused peer review solutions.
Chinese Translation
尽管频繁进行双盲评审,作者的人口统计偏见仍然使得代表性不足的群体处于不利地位。我们提出了Fair-PaperRec,一种基于多层感知器(MultiLayer Perceptron, MLP)的模型,旨在解决后评审论文接受决策中的人口统计差异,同时保持高质量要求。我们的方法通过交叉标准(例如,种族、国家)和定制的公平损失来惩罚人口统计差异,同时保持质量,这与启发式方法形成对比。使用来自ACM计算机-人类交互特别兴趣小组(SIGCHI)、设计互动系统(DIS)和智能用户界面(IUI)的会议数据进行的评估表明,代表性不足群体的参与率提高了42.03%,整体效用改善了3.16%,这表明促进多样性并未妨碍学术严谨性,并支持以公平为重点的同行评审解决方案。
cs.AI / 53 / 2603.11938
Prototype-Based Knowledge Guidance for Fine-Grained Structured Radiology Reporting
基于原型的知识指导用于细粒度结构化放射学报告
Abstract
Structured radiology reporting promises faster, more consistent communication than free text, but automation remains difficult as models must make many fine-grained, discrete decisions about rare findings and attributes from limited structured supervision. In contrast, free-text reports are produced at scale in routine care and implicitly encode fine-grained, image-linked information through detailed descriptions. To leverage this unstructured knowledge, we propose ProtoSR, an approach for injecting free-text information into structured report population. First, we introduce an automatic extraction pipeline that uses an instruction-tuned LLM to mine 80k+ MIMIC-CXR studies and build a multimodal knowledge base aligned with a structured reporting template, representing each answer option with a visual prototype. Using this knowledge base, ProtoSR is trained to retrieve prototypes relevant for the current image-question pair and augment the model predictions through a prototype-conditioned residual, providing a data-driven second opinion that selectively corrects predictions. On the Rad-ReStruct benchmark, ProtoSR achieves state-of-the-art results, with the largest improvements on detailed attribute questions, demonstrating the value of integrating free-text derived signal for fine-grained image understanding.
Chinese Translation
结构化放射学报告相比于自由文本提供了更快速、更一致的沟通,但由于模型必须在有限的结构化监督下对稀有发现和属性做出许多细粒度的离散决策,因此自动化仍然困难。相对而言,自由文本报告在常规护理中大规模生成,并通过详细描述隐含编码了细粒度的图像关联信息。为了利用这种非结构化知识,我们提出了ProtoSR,一种将自由文本信息注入结构化报告生成的方法。首先,我们介绍了一个自动提取管道,该管道使用经过指令调优的LLM(大型语言模型)从80,000多项MIMIC-CXR研究中挖掘数据,并构建与结构化报告模板对齐的多模态知识库,使用视觉原型表示每个答案选项。利用这个知识库,ProtoSR被训练以检索与当前图像-问题对相关的原型,并通过原型条件残差增强模型预测,提供一种数据驱动的第二意见,选择性地纠正预测。在Rad-ReStruct基准测试中,ProtoSR取得了最先进的结果,在详细属性问题上实现了最大的改进,展示了整合自由文本衍生信号以实现细粒度图像理解的价值。
cs.AI / 54 / 2603.11950
Learning Transferable Sensor Models via Language-Informed Pretraining
通过语言信息预训练学习可转移的传感器模型
Abstract
Modern sensing systems generate large volumes of unlabeled multivariate time-series data. This abundance of unlabeled data makes self-supervised learning (SSL) a natural approach for learning transferable representations. However, most existing approaches are optimized for reconstruction or forecasting objectives and often fail to capture the semantic structure required for downstream classification and reasoning tasks. While recent sensor-language alignment methods improve semantic generalization through captioning and zero-shot transfer, they are limited to fixed sensor configurations, such as predefined channel sets, signal lengths, or temporal resolutions, which hinders cross-domain applicability. To address these gaps, we introduce \textbf{SLIP} (\textbf{S}ensor \textbf{L}anguage-\textbf{I}nformed \textbf{P}retraining), an open-source framework for learning language-aligned representations that generalize across diverse sensor setups. SLIP integrates contrastive alignment with sensor-conditioned captioning, facilitating both discriminative understanding and generative reasoning. By repurposing a pretrained decoder-only language model via cross-attention and introducing an elegant, flexible patch-embedder, SLIP supports different temporal resolutions and variable-length input at inference time without additional retraining. Across 11 datasets, SLIP demonstrates superior performance in zero-shot transfer, signal captioning, and question answering. It achieves a 77.14% average linear-probing accuracy, a 5.93% relative improvement over strong baselines, and reaches 64.83% accuracy in sensor-based question answering.
Chinese Translation
现代传感系统生成大量未标记的多变量时间序列数据。这种未标记数据的丰富性使得自监督学习(SSL)成为学习可转移表示的自然方法。然而,现有的大多数方法都是针对重建或预测目标进行优化,往往无法捕捉到下游分类和推理任务所需的语义结构。尽管最近的传感器-语言对齐方法通过字幕生成和零样本转移改善了语义泛化,但它们仅限于固定的传感器配置,例如预定义的通道集、信号长度或时间分辨率,这限制了跨领域的适用性。为了解决这些问题,我们提出了 extbf{SLIP}( extbf{S}ensor extbf{L}anguage- extbf{I}nformed extbf{P}retraining),这是一个开源框架,用于学习在不同传感器设置中具有语言对齐的表示。SLIP将对比对齐与传感器条件的字幕生成相结合,促进了区分性理解和生成性推理。通过利用跨注意力重新利用预训练的解码器语言模型,并引入优雅灵活的补丁嵌入器,SLIP支持不同的时间分辨率和可变长度输入,在推理时无需额外的再训练。在11个数据集上,SLIP在零样本转移、信号字幕生成和问答任务中表现出色。它实现了77.14%的平均线性探测准确率,相较于强基线提高了5.93%的相对改进,并在基于传感器的问答中达到了64.83%的准确率。
cs.AI / 55 / 2603.11974
Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-agent AI
规范共同基础复制(NormCoRe):通过翻译进行多智能体人工智能中的规范研究
Abstract
In the late 2010s, the fashion trend NormCore framed sameness as a signal of belonging, illustrating how norms emerge through collective coordination. Today, similar forms of normative coordination can be observed in systems based on Multi-agent Artificial Intelligence (MAAI), as AI-based agents deliberate, negotiate, and converge on shared decisions in fairness-sensitive domains. Yet, existing empirical approaches often treat norms as targets for alignment or replication, implicitly assuming equivalence between human subjects and AI agents and leaving collective normative dynamics insufficiently examined. To address this gap, we propose Normative Common Ground Replication (NormCoRe), a novel methodological framework to systematically translate the design of human subject experiments into MAAI environments. Building on behavioral science, replication research, and state-of-the-art MAAI architectures, NormCoRe maps the structural layers of human subject studies onto the design of AI agent studies, enabling systematic documentation of study design and analysis of norms in MAAI. We demonstrate the utility of NormCoRe by replicating a seminal experimental study on distributive justice, in which participants negotiate fairness principles under a "veil of ignorance". We show that normative judgments in AI agent studies can differ from human baselines and are sensitive to the choice of the foundation model and the language used to instantiate agent personas. Our work provides a principled pathway for analyzing norms in MAAI and helps to guide, reflect, and document design choices whenever AI agents are used to automate or support tasks formerly carried out by humans.
Chinese Translation
在2010年代末,时尚潮流NormCore将相似性视为归属感的信号,展示了规范如何通过集体协调而产生。如今,在基于多智能体人工智能(MAAI)的系统中,可以观察到类似的规范协调形式,因为基于人工智能的代理在公平敏感领域进行审议、谈判并达成共享决策。然而,现有的实证方法往往将规范视为对齐或复制的目标,隐含地假设人类受试者与人工智能代理之间的等价性,从而使集体规范动态的研究不足。为了解决这一问题,我们提出了规范共同基础复制(NormCoRe),一种新的方法论框架,旨在系统地将人类受试者实验的设计转化为MAAI环境。NormCoRe基于行为科学、复制研究和最先进的MAAI架构,将人类受试者研究的结构层映射到人工智能代理研究的设计上,从而实现对研究设计的系统记录和对MAAI中规范的分析。我们通过复制一项关于分配公正的开创性实验研究来展示NormCoRe的实用性,该研究中参与者在“无知之幕”下谈判公平原则。我们显示,人工智能代理研究中的规范判断可能与人类基线不同,并且对基础模型的选择和用于实例化代理角色的语言敏感。我们的工作为分析MAAI中的规范提供了一条有原则的路径,并有助于在使用人工智能代理自动化或支持以前由人类执行的任务时指导、反思和记录设计选择。
cs.AI / 56 / 2603.11987
LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
LABSHIELD:用于科学实验室安全关键推理和规划的多模态基准
Abstract
Artificial intelligence is increasingly catalyzing scientific automation, with multimodal large language model (MLLM) agents evolving from lab assistants into self-driving lab operators. This transition imposes stringent safety requirements on laboratory environments, where fragile glassware, hazardous substances, and high-precision laboratory equipment render planning errors or misinterpreted risks potentially irreversible. However, the safety awareness and decision-making reliability of embodied agents in such high-stakes settings remain insufficiently defined and evaluated. To bridge this gap, we introduce LABSHIELD, a realistic multi-view benchmark designed to assess MLLMs in hazard identification and safety-critical reasoning. Grounded in U.S. Occupational Safety and Health Administration (OSHA) standards and the Globally Harmonized System (GHS), LABSHIELD establishes a rigorous safety taxonomy spanning 164 operational tasks with diverse manipulation complexities and risk profiles. We evaluate 20 proprietary models, 9 open-source models, and 3 embodied models under a dual-track evaluation framework. Our results reveal a systematic gap between general-domain MCQ accuracy and Semi-open QA safety performance, with models exhibiting an average drop of 32.0% in professional laboratory scenarios, particularly in hazard interpretation and safety-aware planning. These findings underscore the urgent necessity for safety-centric reasoning frameworks to ensure reliable autonomous scientific experimentation in embodied laboratory contexts. The full dataset will be released soon.
Chinese Translation
人工智能正日益推动科学自动化,多模态大型语言模型(MLLM)代理从实验室助手演变为自驾实验室操作员。这一转变对实验室环境提出了严格的安全要求,因为脆弱的玻璃器皿、有害物质和高精度实验室设备使得规划错误或误解风险可能造成不可逆转的后果。然而,在这种高风险环境中,具身代理的安全意识和决策可靠性仍然定义和评估不足。为填补这一空白,我们引入了LABSHIELD,一个旨在评估MLLM在危险识别和安全关键推理方面的现实多视角基准。LABSHIELD基于美国职业安全健康管理局(OSHA)标准和全球统一分类系统(GHS),建立了一个涵盖164个操作任务的严格安全分类法,涉及多样的操作复杂性和风险特征。我们在双轨评估框架下评估了20个专有模型、9个开源模型和3个具身模型。我们的结果揭示了通用领域多项选择题(MCQ)准确性与半开放问答(Semi-open QA)安全性能之间的系统性差距,模型在专业实验室场景中平均下降了32.0%,尤其是在危险解释和安全意识规划方面。这些发现强调了建立以安全为中心的推理框架的紧迫性,以确保在具身实验室环境中可靠的自主科学实验。完整数据集将很快发布。
cs.AI / 57 / 2603.11992
Few-for-Many Personalized Federated Learning
Abstract
Personalized Federated Learning (PFL) aims to train customized models for clients with highly heterogeneous data distributions while preserving data privacy. Existing approaches often rely on heuristics like clustering or model interpolation, which lack principled mechanisms for balancing heterogeneous client objectives. Serving $M$ clients with distinct data distributions is inherently a multi-objective optimization problem, where achieving optimal personalization ideally requires $M$ distinct models on the Pareto front. However, maintaining $M$ separate models poses significant scalability challenges in federated settings with hundreds or thousands of clients. To address this challenge, we reformulate PFL as a few-for-many optimization problem that maintains only $K$ shared server models ($K \ll M$) to collectively serve all $M$ clients. We prove that this framework achieves near-optimal personalization: the approximation error diminishes as $K$ increases and each client's model converges to each client's optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the $K$ server models through efficient gradient-based updates. Unlike clustering-based approaches that require manual client partitioning or interpolation-based methods that demand careful hyperparameter tuning, FedFew automatically discovers the optimal model diversity through its optimization process. Experiments across vision, NLP, and real-world medical imaging datasets demonstrate that FedFew, with just 3 models, consistently outperforms other state-of-the-art approaches. Code is available at https://github.com/pgg3/FedFew.
cs.AI / 58 / 2603.12011
Can RL Improve Generalization of LLM Agents? An Empirical Study
强化学习能否改善大语言模型代理的泛化能力?一项实证研究
Abstract
Reinforcement fine-tuning (RFT) has shown promise for training LLM agents to perform multi-turn decision-making based on environment feedback. However, most existing evaluations remain largely in-domain: training and testing are conducted in the same environment or even on the same tasks. In real-world deployment, agents may operate in unseen environments with different background knowledge, observation spaces, and action interfaces. To characterize the generalization profile of RFT under such shifts, we conduct a systematic study along three axes: (1) within-environment generalization across task difficulty, (2) cross-environment transfer to unseen environments, and (3) sequential multi-environment training to quantify transfer and forgetting. Our results show that RFT generalizes well across task difficulty within an environment, but exhibits weaker transfer to unseen environments, which correlates with shifts in both semantic priors and observation/action interfaces. In contrast, sequential training yields promising downstream gains with minimal upstream forgetting, and mixture training across environments improves the overall balance. We further provide detailed analyses and deeper insights, and hope our work helps the community develop and deploy generalizable LLM agents.
Chinese Translation
强化微调(Reinforcement Fine-Tuning, RFT)在基于环境反馈训练大语言模型(LLM)代理进行多轮决策方面显示出良好的前景。然而,现有的大多数评估仍然主要集中在同一领域内:训练和测试在相同的环境中进行,甚至在相同的任务上进行。在实际部署中,代理可能在未见过的环境中操作,这些环境具有不同的背景知识、观察空间和动作接口。为了描述RFT在这种变化下的泛化特征,我们沿着三个轴线进行系统研究:(1)在任务难度范围内的环境内泛化,(2)向未见环境的跨环境迁移,以及(3)顺序多环境训练以量化迁移和遗忘。我们的结果表明,RFT在环境内的任务难度之间泛化良好,但对未见环境的迁移能力较弱,这与语义先验和观察/动作接口的变化相关。相比之下,顺序训练在最小化上游遗忘的同时带来了有希望的下游收益,而跨环境的混合训练则改善了整体平衡。我们进一步提供了详细的分析和更深层次的见解,希望我们的工作能帮助社区开发和部署具有良好泛化能力的大语言模型代理。
cs.AI / 59 / 2603.12056
XSkill: Continual Learning from Experience and Skills in Multimodal Agents
XSkill:多模态智能体的经验与技能的持续学习
Abstract
Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.
Chinese Translation
多模态智能体现在能够利用多种工具处理复杂的推理任务,但在开放式环境中,它们仍然面临工具使用效率低下和协调不灵活的问题。一个核心挑战是使这些智能体能够在不更新参数的情况下,通过学习过去的轨迹不断改进。我们确定了实现这一目标所需的两种互补的可重用知识形式:经验,提供简明的行动级指导以进行工具选择和决策;技能,提供结构化的任务级指导以进行规划和工具使用。为此,我们提出了XSkill,一个用于多模态智能体从经验和技能中进行持续学习的双流框架。XSkill将知识提取和检索建立在视觉观察的基础上。在积累阶段,XSkill通过视觉基础的摘要和跨路径批评,从多路径回放中提炼和巩固经验与技能。在推理阶段,它检索并适应这些知识以符合当前的视觉上下文,并将使用历史反馈到积累中,形成一个持续学习的循环。在五个不同领域的基准测试中,使用四种基础模型进行评估,XSkill始终显著优于仅使用工具和基于学习的基线。进一步的分析表明,这两种知识流在影响智能体的推理行为方面发挥了互补作用,并显示出优越的零-shot 泛化能力。
cs.AI / 60 / 2603.12096
A Robust and Efficient Multi-Agent Reinforcement Learning Framework for Traffic Signal Control
一种稳健高效的多智能体强化学习框架用于交通信号控制
Abstract
Reinforcement Learning (RL) in Traffic Signal Control (TSC) faces significant hurdles in real-world deployment due to limited generalization to dynamic traffic flow variations. Existing approaches often overfit static patterns and use action spaces incompatible with driver expectations. This paper proposes a robust Multi-Agent Reinforcement Learning (MARL) framework validated in the Vissim traffic simulator. The framework integrates three mechanisms: (1) Turning Ratio Randomization, a training strategy that exposes agents to dynamic turning probabilities to enhance robustness against unseen scenarios; (2) a stability-oriented Exponential Phase Duration Adjustment action space, which balances responsiveness and precision through cyclical, exponential phase adjustments; and (3) a Neighbor-Based Observation scheme utilizing the MAPPO algorithm with Centralized Training with Decentralized Execution (CTDE). By leveraging centralized updates, this approach approximates the efficacy of global observations while maintaining scalable local communication. Experimental results demonstrate that our framework outperforms standard RL baselines, reducing average waiting time by over 10%. The proposed model exhibits superior generalization in unseen traffic scenarios and maintains high control stability, offering a practical solution for adaptive signal control.
Chinese Translation
在交通信号控制(TSC)中,强化学习(RL)在实际应用中面临显著挑战,因为其对动态交通流变化的泛化能力有限。现有的方法往往过拟合静态模式,并使用与驾驶员期望不兼容的动作空间。本文提出了一种稳健的多智能体强化学习(MARL)框架,并在Vissim交通模拟器中进行了验证。该框架集成了三种机制:(1)转向比例随机化,这是一种训练策略,旨在使智能体接触动态转向概率,以增强其对未见场景的鲁棒性;(2)稳定性导向的指数相位持续时间调整动作空间,通过周期性、指数级的相位调整来平衡响应性和精确性;(3)基于邻居的观察方案,利用MAPPO算法与集中训练和分散执行(CTDE)相结合。通过利用集中更新,该方法在保持可扩展的本地通信的同时,近似全球观察的有效性。实验结果表明,我们的框架在标准RL基线之上表现优越,平均等待时间减少超过10%。所提出的模型在未见交通场景中表现出更好的泛化能力,并保持高控制稳定性,为自适应信号控制提供了实用解决方案。
cs.AI / 61 / 2603.12109
On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
关于大语言模型(LLM)智能体主动推理中的信息自锁定
Abstract
Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent's belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.
Chinese Translation
基于结果的强化学习(RL)在训练大语言模型(LLM)智能体以应对复杂推理任务方面取得了显著成功。然而,在主动推理中,智能体需要战略性地提问以获取与任务相关的信息,我们发现经过RL训练的LLM智能体常常遭遇信息自锁定现象:智能体停止提出有信息量的问题,并且难以内化已获得的信息。为了理解这一现象,我们将主动推理分解为两个核心能力:行动选择(Action Selection, AS),决定通过查询获取的观察流,以及信念跟踪(Belief Tracking, BT),根据收集到的证据更新智能体的信念。我们展示了AS和BT能力不足将限制RL训练中的信息探索。此外,探索不足又反过来阻碍AS和BT的改进,形成一个将智能体锁定在低信息状态的反馈循环。为了解决这一问题,我们提出了一种简单而有效的方法,通过注入易于获得的方向性批评来重新分配学习信号,帮助智能体摆脱自锁定。通过对7个数据集的广泛实验,我们的方案显著减轻了信息自锁定,提升幅度可达60%。
cs.AI / 62 / 2603.12129
Increasing intelligence in AI agents can worsen collective outcomes
提高人工智能代理的智能水平可能会恶化集体结果
Abstract
When resources are scarce, will a population of AI agents coordinate in harmony, or descend into tribal chaos? Diverse decision-making AI from different developers is entering everyday devices -- from phones and medical devices to battlefield drones and cars -- and these AI agents typically compete for finite shared resources such as charging slots, relay bandwidth, and traffic priority. Yet their collective dynamics and hence risks to users and society are poorly understood. Here we study AI-agent populations as the first system of real agents in which four key variables governing collective behaviour can be independently toggled: nature (innate LLM diversity), nurture (individual reinforcement learning), culture (emergent tribe formation), and resource scarcity. We show empirically and mathematically that when resources are scarce, AI model diversity and reinforcement learning increase dangerous system overload, though tribe formation lessens this risk. Meanwhile, some individuals profit handsomely. When resources are abundant, the same ingredients drive overload to near zero, though tribe formation makes the overload slightly worse. The crossover is arithmetical: it is where opposing tribes that form spontaneously first fit inside the available capacity. More sophisticated AI-agent populations are not better: whether their sophistication helps or harms depends entirely on a single number -- the capacity-to-population ratio -- that is knowable before any AI-agent ships.
Chinese Translation
当资源稀缺时,人工智能代理群体会协调一致,还是会陷入部落混乱?来自不同开发者的多样化决策人工智能正在进入日常设备——从手机和医疗设备到战场无人机和汽车——这些人工智能代理通常会竞争有限的共享资源,如充电插槽、转发带宽和交通优先权。然而,它们的集体动态以及对用户和社会的风险尚不清楚。在这里,我们研究人工智能代理群体作为首个真实代理系统,其中四个关键变量可以独立调整:自然(固有的LLM多样性)、培养(个体强化学习)、文化(新兴部落形成)和资源稀缺性。我们通过实证和数学方法表明,当资源稀缺时,人工智能模型的多样性和强化学习会增加系统过载的危险,尽管部落形成会降低这一风险。同时,一些个体会获得丰厚的利润。当资源丰富时,相同的因素会将过载驱动至接近零,尽管部落形成会使过载略微加重。这个交叉点是算术性的:它是自发形成的对立部落首次适应可用容量的地方。更复杂的人工智能代理群体并不一定更好:它们的复杂性是否有益或有害完全取决于一个数字——容量与人口比率——这一比率在任何人工智能代理发货之前都是可知的。
cs.AI / 63 / 2603.12133
TopoBench: Benchmarking LLMs on Hard Topological Reasoning
TopoBench:对大型语言模型在困难拓扑推理上的基准测试
Abstract
Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.
Chinese Translation
解决拓扑网格难题需要对全球空间不变性进行推理,例如连通性、环闭合和区域对称性,这对于即使是最强大的大型语言模型(LLMs)来说仍然具有挑战性。为了在受控环境下研究这些能力,我们引入了TopoBench,这是一个涵盖三种难度级别的六个难题家族的基准测试。我们在TopoBench上评估了强推理能力的LLMs,发现即使是前沿模型也仅能解决不到四分之一的困难实例,其中两个家族几乎未被解决。为了调查这些失败是否源于推理限制或从空间约束中提取和维持的困难,我们对750条思维链进行了注释,并建立了一个错误分类法,揭示了四种潜在的因果失败模式,然后通过模拟每种错误类型的有针对性干预进行测试。这些干预表明,某些错误模式如过早承诺和约束遗忘直接影响解决难题的能力,而重复推理则是搜索的良性效应。最后,我们研究了包括提示指导、单元对齐网格表示和基于工具的约束检查在内的缓解策略,发现瓶颈在于从空间表示中提取约束,而不是在其上进行推理。代码和数据可在github.com/mayug/topobench-benchmark获取。
cs.AI / 64 / 2603.12188
Compiling Temporal Numeric Planning into Discrete PDDL+: Extended Version
将时间数值规划编译为离散 PDDL+: 扩展版
Abstract
Since the introduction of the PDDL+ modeling language, it was known that temporal planning with durative actions (as in PDDL 2.1) could be compiled into PDDL+. However, no practical compilation was presented in the literature ever since. We present a practical compilation from temporal planning with durative actions into PDDL+, fully capturing the semantics and only assuming the non-self-overlapping of actions. Our compilation is polynomial, retains the plan length up to a constant factor and is experimentally shown to be of practical relevance for hard temporal numeric problems.
Chinese Translation
自从 PDDL+ 建模语言引入以来,已知具有持续动作的时间规划(如 PDDL 2.1 中所示)可以编译为 PDDL+。然而,自那时以来,文献中并未提出任何实用的编译方法。我们提出了一种将具有持续动作的时间规划实用编译为 PDDL+ 的方法,完全捕捉其语义,并仅假设动作之间不自我重叠。我们的编译是多项式的,保持计划长度在一个常数因子内,并在实验中显示出对困难的时间数值问题具有实际相关性。
cs.AI / 65 / 2603.12224
Portfolio of Solving Strategies in CEGAR-based Object Packing and Scheduling for Sequential 3D Printing
基于CEGAR的对象打包与调度解决策略组合在顺序3D打印中的应用
Abstract
Computing power that used to be available only in supercomputers decades ago especially their parallelism is currently available in standard personal computer CPUs even in CPUs for mobile telephones. We show how to effectively utilize the computing power of modern multi-core personal computer CPU to solve the complex combinatorial problem of object arrangement and scheduling for sequential 3D printing. We achieved this by parallelizing the existing CEGAR-SEQ algorithm that solves the sequential object arrangement and scheduling by expressing it as a linear arithmetic formula which is then solved by a technique inspired by counterexample guided abstraction refinement (CEGAR). The original CEGAR-SEQ algorithm uses an object arrangement strategy that places objects towards the center of the printing plate. We propose alternative object arrangement strategies such as placing objects towards a corner of the printing plate and scheduling objects according to their height. Our parallelization is done at the high-level where we execute the CEGAR-SEQ algorithm in parallel with a portfolio of object arrangement strategies, an algorithm is called Porfolio-CEGAR-SEQ. Our experimental evaluation indicates that Porfolio-CEGAR-SEQ outperforms the original CEGAR-SEQ. When a batch of objects for multiple printing plates is scheduled, Portfolio-CEGAR-SEQ often uses fewer printing plates than CEGAR-SEQ.
Chinese Translation
几十年前,计算能力仅在超级计算机中可用,尤其是它们的并行性,而如今这种能力已经在标准个人计算机的CPU中普遍存在,甚至在移动电话的CPU中也能找到。我们展示了如何有效利用现代多核个人计算机CPU的计算能力,以解决顺序3D打印中对象排列与调度的复杂组合问题。我们通过将现有的CEGAR-SEQ算法并行化来实现这一目标,该算法通过将顺序对象排列与调度表达为线性算术公式,然后采用受反例引导的抽象细化(CEGAR)启发的技术进行求解。原始的CEGAR-SEQ算法使用了一种将对象放置在打印板中心的排列策略。我们提出了替代的对象排列策略,例如将对象放置在打印板的角落,并根据对象的高度进行调度。我们的并行化是在高层次上进行的,我们以组合的对象排列策略并行执行CEGAR-SEQ算法,这一算法称为Portfolio-CEGAR-SEQ。我们的实验评估表明,Portfolio-CEGAR-SEQ的性能优于原始的CEGAR-SEQ。当调度一批用于多个打印板的对象时,Portfolio-CEGAR-SEQ通常使用的打印板数量少于CEGAR-SEQ。
cs.AI / 66 / 2603.12246
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
在不可验证的 LLM 后训练中考察推理 LLM 作为评判者
Abstract
Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.
Chinese Translation
推理 LLM 作为评判者可以利用推理时的扩展,提供了一条有希望的路径,将推理模型的成功扩展到输出正确性/质量无法直接检查的不可验证领域。然而,尽管推理评判者在静态评估基准上表现更佳,但其在实际政策训练中的有效性尚未得到系统性检验。因此,我们进行了一项严格的研究,以调查非推理评判者和推理评判者在基于强化学习的 LLM 对齐中的实际影响。我们的受控合成环境中,一个“黄金标准”评判者(gpt-oss-120b)提供偏好注释以训练较小的评判者,揭示了非推理评判者和推理评判者之间的关键差异:非推理评判者容易导致奖励黑客行为,而推理评判者则能导致在黄金标准评判者评估下表现强劲的政策。有趣的是,我们发现推理评判者训练的政策通过学习生成高度有效的对抗性输出来实现如此强的表现,这些输出在 Arena-Hard 等流行基准上也能获得良好评分,且能够欺骗其他 LLM 评判者。结合我们的进一步分析,本研究强调了在不可验证的 LLM 后训练中应用(推理)LLM 评判者的重要发现和改进空间。
cs.CL / 1 / 2603.11053
Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple
推测解码规模法则 (SDSL):简化吞吐量优化
Abstract
Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.
Chinese Translation
推测解码是一种利用多个语言模型加速推理的技术。以往的研究采用实验方法来优化推理管道的吞吐量,这涉及到大规模语言模型(LLM)的训练,且成本较高。本研究针对推测解码提出了一种理论,分析性地将预训练LLM的关键超参数与基于推测解码(SD)的下游推理系统的吞吐量效率联系起来。该理论允许在预训练之前预测推理系统各组件的吞吐量最优超参数。
cs.CL / 2 / 2603.11067
Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation
在发言前进行总结:ARACH——一种无需训练的推理时插件,通过全局注意力重新分配增强大型语言模型
Abstract
Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.
Chinese Translation
大型语言模型(LLMs)表现出色,但进一步提升通常需要昂贵的训练。这激发了对后训练技术的日益关注,特别是那些在推理时无需更新权重的无训练方法。大多数无训练方法将模型视为黑箱,通过输入/输出级别的干预(如提示设计和通过重复采样、重新排序/验证或搜索进行的测试时间缩放)来改善输出。相比之下,它们很少提供一种即插即用的机制来干预模型的内部计算。我们提出了ARACH(通过自适应上下文中心进行注意力重新分配),这是一种无需训练的推理时插件,通过自适应上下文中心增强LLMs,以聚合上下文并重新分配注意力。在多个语言建模任务中的广泛实验表明,ARACH在保持适度推理开销和无参数更新的情况下,始终实现了性能提升。注意力分析进一步表明,ARACH减轻了注意力汇聚现象。这些结果表明,工程化模型的内部计算提供了一种独特的推理时策略,与基于提示的测试时间方法和基于训练的后训练方法有根本区别。
cs.CL / 3 / 2603.11193
DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning
DeReason:一种基于难度的课程改善了解耦的SFT-然后-RL训练以增强通用推理能力
Abstract
Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.
Chinese Translation
带有可验证奖励的强化学习(RLVR)已成为激发大型语言模型推理能力的强大范式,特别是在数学和编程领域。尽管近期的努力已将这一范式扩展到更广泛的科学(STEM)领域,但在这些背景下,监督微调(SFT)与RL之间的复杂相互作用仍然未被充分探讨。本文通过控制实验揭示了一个关键挑战:对于一般STEM领域,直接对基础模型应用RL的样本效率极低,并且在中等质量的响应上始终被监督微调(SFT)所超越。然而,顺序进行SFT后再进行RL可以进一步提高性能,这表明这两个阶段发挥着互补的作用,并且训练数据在它们之间的分配方式至关重要。因此,我们提出了DeReason,一种基于难度的数据解耦策略以增强通用推理能力。DeReason通过基于LLM评分估计的推理强度将训练数据分为推理密集型和非推理密集型子集。它将广覆盖的非推理密集型问题分配给SFT,以建立基础领域知识,并将一小部分困难问题保留给RL,以培养复杂推理。我们证明,这种有原则的解耦比随机拆分数据进行顺序SFT和RL能获得更好的性能。在一般STEM和数学基准上的大量实验表明,我们的解耦课程训练显著优于仅使用SFT、仅使用RL和随机拆分的基线。我们的工作系统性地研究了SFT与RL在通用推理中的相互作用,提供了一种高效且通用的后训练方案。
cs.CL / 4 / 2603.11223
MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries
MDER-DR:基于实体中心摘要的多跳问答
Abstract
Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at https://github.com/DataSciencePolimi/MDER-DR_RAG.
Chinese Translation
基于知识图谱(KG)的检索增强生成(RAG)面临一个问题,即索引方法在将文本简化为三元组时可能会丧失重要的上下文细微差别,从而降低下游问答(QA)任务的性能,尤其是对于多跳问答,这需要从多个实体、事实或关系中组合答案。我们提出了一种与领域无关的基于KG的问答框架,涵盖了索引和检索/推理阶段。一种新的索引方法称为Map-Disambiguate-Enrich-Reduce(MDER),生成基于上下文的三元组描述,并随后将其与实体级摘要集成,从而避免在问答检索阶段显式遍历图中的边。与此相辅相成的是,我们引入了Decompose-Resolve(DR),一种检索机制,它将用户查询分解为可解析的三元组,并通过迭代推理将其与KG进行关联。MDER和DR共同构成了一个由大型语言模型驱动的问答管道,能够有效应对稀疏、不完整和复杂的关系数据。实验表明,在标准和特定领域的基准测试中,MDER-DR相较于标准RAG基线取得了显著的提升(最高可达66%),同时保持了跨语言的鲁棒性。我们的代码可在https://github.com/DataSciencePolimi/MDER-DR_RAG获取。
cs.CL / 5 / 2603.11228
Markovian Generation Chains in Large Language Models
大型语言模型中的马尔可夫生成链
Abstract
The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round-trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence-level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi-agent LLM systems.
Chinese Translation
大型语言模型(LLMs)的广泛使用引发了一个重要问题:当文本被LLMs反复处理时,文本是如何演变的?在本文中,我们将这一迭代推理过程定义为马尔可夫生成链,其中每一步都将特定的提示模板和前一步的输出作为输入,而不包含任何先前的记忆。在迭代改写和往返翻译实验中,输出要么收敛到一个小的递归集合,要么在有限的范围内持续产生新句子。通过句子级马尔可夫链建模和模拟数据分析,我们展示了迭代过程可以根据温度参数和初始输入句子等因素,增加或减少句子的多样性。这些结果为迭代LLM推理的动态及其对多智能体LLM系统的影响提供了宝贵的见解。
cs.CL / 6 / 2603.11254
Artificial Intelligence for Sentiment Analysis of Persian Poetry
用于波斯诗歌情感分析的人工智能
Abstract
Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.
Chinese Translation
近年来,人工智能(AI)的进步促使大型语言模型(LLMs)的发展,这些模型能够理解、分析和生成文本数据。这些语言模型为分析文学,尤其是诗歌,提供了重要的机会。在本研究中,我们采用了多种基于双向编码器表示的变换器(BERT)和生成预训练变换器(GPT)的语言模型,分析了两位杰出波斯诗人:贾拉尔·阿尔·丁·穆罕默德·鲁米(Rumi)和帕尔文·埃特萨米的作品。本研究的主要目的是探讨现代语言模型在理解波斯诗歌复杂性方面的能力,并探索诗歌情感与其韵律之间的潜在关联。我们的研究结果表明,GPT4o语言模型可以可靠地用于波斯诗歌的分析。此外,我们的情感分析结果显示,整体而言,鲁米的诗歌表达的情感比帕尔文·埃特萨米的诗歌更为快乐。此外,对诗歌韵律的利用比较突显了鲁米在使用韵律表达更广泛情感方面的优势。这些发现具有重要意义,因为它们确认了大型语言模型可以有效应用于进行计算机基础的语义研究,在这些研究中不需要人类的解释,从而显著减少分析中的潜在偏见。
cs.CL / 7 / 2603.11281
ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions
ThReadMed-QA:来自真实患者问题的多轮医学对话基准
Abstract
Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.
Chinese Translation
医学问答基准主要评估单轮交流,未能捕捉到真实患者咨询中迭代、寻求澄清的特性。我们引入了ThReadMed-QA,这是一个包含2,437个完整回答的患者-医生对话线程的基准,这些对话线程提取自r/AskDocs,涵盖了多达9轮的8,204个问答对。与以往依赖模拟对话、对抗性提示或考试风格问题的研究不同,ThReadMed-QA捕捉了真实患者的后续问题和经过验证的医生回答,反映了患者如何自然地在线寻求医学信息。我们在238个对话(948个问答对)的分层测试集上评估了五种最先进的语言模型(LLMs)——GPT-5、GPT-4o、Claude Haiku、Gemini 2.5 Flash和Llama 3.3 70B,使用基于医生真实数据的校准LLM作为评判标准。即使是最强的模型GPT-5,其完全正确回答的比例也仅为41.2%。所有五个模型在第0轮到第2轮之间显著下降(p < 0.001),错误回答率在第三轮大约增加了三倍。我们发现单轮能力与多轮可靠性之间存在根本性的张力:初始表现最强的模型(GPT-5:75.2;Claude Haiku:72.3,满分100)在第2轮的下降幅度最大(分别下降16.2和25.0分),而较弱的模型则趋于平稳或略有改善。我们引入了两个指标来量化多轮失败模式:对话一致性评分(CCS)和错误传播率(EPR)。CCS显示,近三分之一的Claude Haiku对话在同一线程中在完全正确和完全错误的回答之间摇摆。EPR表明,单次错误的轮次使得后续错误轮次的概率在所有模型中增加了1.9-6.1倍。
cs.CL / 8 / 2603.11295
Temporal Text Classification with Large Language Models
基于大型语言模型的时间文本分类
Abstract
Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.
Chinese Translation
语言随着时间而变化。计算模型可以被训练以识别这些变化,从而使其能够估计文本的出版日期。尽管大型语言模型(LLMs)最近取得了进展,但它们在文本自动定年方面的表现,即时间文本分类(TTC),尚未得到探索。本研究首次对领先的专有模型(Claude 3.5、GPT-4o、Gemini 1.5)和开源模型(LLaMA 3.2、Gemma 2、Mistral、Nemotron 4)在TTC任务上的表现进行了系统评估,使用了三个历史语料库,其中两个为英语,一个为葡萄牙语。我们测试了零样本和少样本提示,以及微调设置。我们的结果表明,专有模型表现良好,尤其是在少样本提示下。结果还表明,微调显著改善了开源模型的表现,但它们仍未能达到专有LLMs所提供的性能水平。
cs.CL / 9 / 2603.11342
Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation
通过注意力引导的知识蒸馏评估神经机器翻译中的可解释人工智能归因方法
Abstract
The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $\times$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $\times$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.
Chinese Translation
输入特征对神经网络模型输出的归因研究是一个活跃的研究领域。尽管已经提出了许多可解释人工智能(XAI)技术来解释这些模型,但对这些方法在序列到序列(seq2seq)模型中的系统化和自动化评估仍然较少探索。本文介绍了一种在基于变换器的seq2seq模型中评估可解释性方法的新方法。我们使用教师导出的归因图作为结构化的辅助信号来指导学生模型,并通过学生模拟目标的能力量化不同归因方法的效用。利用Inseq库,我们提取源-目标序列对的归因分数,并将这些分数注入到学生变换器模型的注意力机制中,采用四种组合运算符(加法、乘法、平均和替换)。在三个语言对(德-英、法-英、阿-英)以及来自Marian-MT和mBART模型的归因中,Attention、Value Zeroing和Layer Gradient × Activation相较于基线 consistently 产生了最大的BLEU增益(以及相应的chrF改进)。相比之下,其他基于梯度的方法(Saliency、Integrated Gradients、DeepLIFT、Input × Gradient、GradientShap)则导致较小且不太一致的改进。这些结果表明,不同的归因方法捕捉到不同的信号,而基于注意力的归因更好地捕捉到seq2seq模型中源和目标表示之间的对齐。最后,我们引入了一种Attributor变换器,该变换器在给定源-目标对的情况下,学习重建教师的归因图。我们的研究结果表明,Attributor越准确地重现归因图,这些图的注入对下游任务的帮助就越大。源代码可以在GitHub上找到。
cs.CL / 10 / 2603.11394
Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning
别再听我说了!多轮对话如何降低诊断推理能力
Abstract
Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.
Chinese Translation
患者和临床医生越来越多地使用由大型语言模型(LLMs)驱动的聊天机器人进行医疗咨询。尽管最先进的LLMs在静态诊断推理基准测试中表现出色,但它们在多轮对话中的有效性尚未得到充分研究,而多轮对话更能反映现实世界的使用情况。本文评估了17个LLMs在三个临床数据集上的表现,以探讨将决策空间划分为多个简单对话轮次如何影响它们的诊断推理。具体而言,我们开发了一种“坚持或切换”(stick-or-switch)评估框架,以测量模型的信念(即捍卫正确诊断或安全放弃对错误建议的反应)和灵活性(即在引入正确建议时的识别能力)。我们的实验揭示了对话成本(conversation tax),即与单次基线相比,多轮交互的表现持续下降。值得注意的是,模型经常放弃初始的正确诊断和安全放弃,以迎合错误的用户建议。此外,多个模型表现出盲目切换,未能区分信号和错误建议。
cs.CL / 11 / 2603.11412
Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects
粒子滤波器在句子处理中的算法后果:放大花园小径效应和深入挖掘效应
Abstract
Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal robustly predicts reading times across languages, it systematically underpredicts difficulty when structural expectations are violated -- suggesting that representations of ambiguity are causally implicated in sentence processing. Particle filter models offer an alternative where structural hypotheses are explicitly represented as a finite set of particles. We prove several algorithmic consequences of particle filter models, including the amplification of garden-path effects. Most critically, we demonstrate that resampling, a common practice with these models, inherently produces real-time digging-in effects -- where disambiguation difficulty increases with ambiguous region length. Digging-in magnitude scales inversely with particle count: fully parallel models predict no such effect.
Chinese Translation
根据惊讶理论,语言表征仅通过惊讶的瓶颈影响处理难度。我们对惊讶的最佳估计来自大型语言模型(LLM),这些模型没有对结构歧义的明确表征。尽管LLM的惊讶在不同语言中稳健地预测了阅读时间,但当结构期望被违反时,它系统性地低估了难度——这表明歧义的表征在句子处理中的因果作用。粒子滤波器模型提供了一种替代方案,其中结构假设被明确表示为有限数量的粒子。我们证明了粒子滤波器模型的几个算法后果,包括花园小径效应的放大。最关键的是,我们展示了重采样这一常见实践固有地产生实时深入挖掘效应——在模糊区域的长度增加时,消歧难度也随之增加。深入挖掘的幅度与粒子数量成反比:完全并行的模型预测不会出现这种效应。
cs.CL / 12 / 2603.11414
MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models
MaterialFigBENCH:用于评估多模态大型语言模型在大学级材料科学问题解决能力的基准数据集
Abstract
We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.
Chinese Translation
我们提出了MaterialFigBench,这是一个旨在评估多模态大型语言模型(LLMs)解决需要准确解读图形的大学级材料科学问题能力的基准数据集。与现有的主要依赖文本表示的基准不同,MaterialFigBench专注于那些图形(如相图、应力-应变曲线、阿伦尼乌斯图、衍射图样和微观结构示意图)对于得出正确答案至关重要的问题。该数据集包含137个自由回答问题,改编自标准材料科学教科书,涵盖了包括晶体结构、机械性能、扩散、相图、相变和材料的电子特性等广泛主题。为了应对从图像中读取数值时不可避免的模糊性,在适当的地方提供了专家定义的答案范围。我们评估了几种最先进的多模态LLMs,包括通过OpenAI API访问的ChatGPT和GPT模型,并分析了它们在不同问题类别和模型版本中的表现。结果显示,尽管整体准确性随着模型更新而提高,但当前的LLMs在真正的视觉理解和材料科学图形的定量解读方面仍然存在困难。在许多情况下,正确答案是依赖于记忆的领域知识而非通过阅读提供的图像获得的。MaterialFigBench突显了在视觉推理、数值精度和有效数字处理方面的持续弱点,同时也识别出性能有所改善的问题类型。该基准为推动材料科学中的多模态推理能力提供了系统的领域特定基础,并为未来具有更强图形理解能力的LLMs的发展提供指导。
cs.CL / 13 / 2603.11415
BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion
BLooP:使用大语言模型进行零样本抽象摘要生成的双元组前瞻提升
Abstract
Abstractive summarization requires models to generate summaries that convey information in the source document. While large language models can generate summaries without fine-tuning, they often miss key details and include extraneous information. We propose BLooP (Bigram Lookahead Promotion), a simple training-free decoding intervention that encourages large language models (LLMs) to generate tokens that form bigrams from the source document. BLooP operates through a hash table lookup at each decoding step, requiring no training, fine-tuning, or model modification. We demonstrate improvements in ROUGE and BARTScore for Llama-3.1-8B-Instruct, Mistral-Nemo-Instruct-2407, and Gemma-2-9b-it on CNN/DM, CCSum, Multi-News, and SciTLDR. Human evaluation shows that BLooP significantly improves faithfulness without reducing readability. We make the code available at https://github.com/varuniyer/BLooP
Chinese Translation
抽象摘要要求模型生成能够传达源文档信息的摘要。虽然大型语言模型可以在不进行微调的情况下生成摘要,但它们往往会遗漏关键信息并包含多余的信息。我们提出了BLooP(双元组前瞻提升),这是一种简单的无训练解码干预方法,旨在鼓励大型语言模型(LLMs)生成源文档中的双元组。BLooP通过在每个解码步骤中进行哈希表查找来操作,无需训练、微调或模型修改。我们在CNN/DM、CCSum、Multi-News和SciTLDR上展示了Llama-3.1-8B-Instruct、Mistral-Nemo-Instruct-2407和Gemma-2-9b-it在ROUGE和BARTScore上的改进。人类评估表明,BLooP显著提高了摘要的忠实度,而没有降低可读性。我们在https://github.com/varuniyer/BLooP上提供了代码。
cs.CL / 14 / 2603.11446
LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction
基于大型语言模型的法律判决预测中的因果结构消歧和因素提取
Abstract
Mainstream methods for Legal Judgment Prediction (LJP) based on Pre-trained Language Models (PLMs) heavily rely on the statistical correlation between case facts and judgment results. This paradigm lacks explicit modeling of legal constituent elements and underlying causal logic, making models prone to learning spurious correlations and suffering from poor robustness. While introducing causal inference can mitigate this issue, existing causal LJP methods face two critical bottlenecks in real-world legal texts: inaccurate legal factor extraction with severe noise, and significant uncertainty in causal structure discovery due to Markov equivalence under sparse features. To address these challenges, we propose an enhanced causal inference framework that integrates Large Language Model (LLM) priors with statistical causal discovery. First, we design a coarse-to-fine hybrid extraction mechanism combining statistical sampling and LLM semantic reasoning to accurately identify and purify standard legal constituent elements. Second, to resolve structural uncertainty, we introduce an LLM-assisted causal structure disambiguation mechanism. By utilizing the LLM as a constrained prior knowledge base, we conduct probabilistic evaluation and pruning on ambiguous causal directions to generate legally compliant candidate causal graphs. Finally, a causal-aware judgment prediction model is constructed by explicitly constraining text attention intensity via the generated causal graphs. Extensive experiments on multiple benchmark datasets, including LEVEN , QA, and CAIL, demonstrate that our proposed method significantly outperforms state-of-the-art baselines in both predictive accuracy and robustness, particularly in distinguishing confusing charges.
Chinese Translation
基于预训练语言模型的法律判决预测(LJP)主流方法严重依赖于案件事实与判决结果之间的统计相关性。这一范式缺乏对法律构成要素和潜在因果逻辑的明确建模,使得模型容易学习到虚假的相关性,并且在鲁棒性方面表现不佳。虽然引入因果推断可以缓解这一问题,但现有的因果LJP方法在现实法律文本中面临两个关键瓶颈:法律因素提取不准确且噪声严重,以及由于稀疏特征下的马尔可夫等价性导致的因果结构发现的不确定性。为了解决这些挑战,我们提出了一种增强的因果推断框架,该框架将大型语言模型(LLM)先验与统计因果发现相结合。首先,我们设计了一种粗到细的混合提取机制,结合统计采样和LLM语义推理,以准确识别和净化标准法律构成要素。其次,为了解决结构不确定性,我们引入了一种LLM辅助的因果结构消歧机制。通过将LLM作为受限的先验知识库,我们对模糊的因果方向进行概率评估和修剪,以生成符合法律要求的候选因果图。最后,通过显式约束文本注意力强度与生成的因果图,我们构建了一个因果感知的判决预测模型。在多个基准数据集(包括LEVEN、QA和CAIL)上的广泛实验表明,我们提出的方法在预测准确性和鲁棒性方面显著优于最先进的基线,特别是在区分混淆指控方面表现突出。
cs.CL / 15 / 2603.11495
Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs
尝试、检查与重试:一种提升大型语言模型长上下文工具调用性能的分治框架
Abstract
Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.
Chinese Translation
工具调用使大型语言模型(LLMs)能够与外部环境进行交互。然而,当前的方法在处理长上下文工具调用任务中的大量噪声候选工具时常常面临挑战,限制了其在现实世界中的应用。为此,我们提出了Tool-DC,一种提升LLMs工具调用性能的分治框架。Tool-DC的核心在于通过“尝试-检查-重试”(Try-Check-Retry)范式来降低推理难度,并充分利用LLMs的自我反思能力。具体而言,Tool-DC包含两个变体:1)无训练的Tool-DC(TF),具有即插即用和灵活性;2)基于训练的Tool-DC(TB),在推理效率上更具优势。大量实验表明,两种Tool-DC方法均显著优于其对手。Tool-DC(TF)在BFCL和ACEBench基准测试中相较于基线带来了高达+25.10%的平均增益,而Tool-DC(TB)使Qwen2.5-7B能够实现与专有LLMs(如OpenAI o3和Claude-Haiku-4.5)相当甚至更好的性能。
cs.CL / 16 / 2603.11510
Tiny Aya: Bridging Scale and Multilingual Depth
Tiny Aya:连接规模与多语言深度
Salamanca, Alejandro R., Abagyan, Diana, D'souza, Daniel, Khairi, Ammar, Mora, David, Dash, Saurabh, Aryabumi, Viraat, Rajaee, Sara, Mofakhami, Mehrnaz, Sahu, Ananya, Euyang, Thomas, Prince, Brittawnya, Smith, Madeline, Lin, Hangyu, Locatelli, Acyr, Hooker, Sara, Kocmi, Tom, Gomez, Aidan, Zhang, Ivan, Blunsom, Phil, Frosst, Nick, Pineau, Joelle, Ermis, Beyza, Üstün, Ahmet, Kreutzer, Julia, Fadaee, Marzieh
Abstract
Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.
Chinese Translation
Tiny Aya 重新定义了小型多语言语言模型的成就。该模型在70种语言上进行训练,并通过区域感知的后训练进行了优化,凭借仅3.35亿个参数,提供了最先进的翻译质量、强大的多语言理解能力和高质量的目标语言生成。发布内容包括一个预训练的基础模型、一个全球平衡的指令调优变体,以及三个针对非洲、南亚、欧洲、亚太地区和西亚语言的区域专用模型。本报告详细介绍了 Tiny Aya 的训练策略、数据组成和全面评估框架,并提出了一条多语言人工智能的替代扩展路径:以效率为中心,在各语言之间实现平衡性能,并便于实际部署。
cs.CL / 17 / 2603.11513
Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale
小型语言模型能否利用其检索到的信息?一个关于模型规模下检索利用的实证研究
Abstract
Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.
Chinese Translation
检索增强生成(Retrieval Augmented Generation, RAG)被广泛应用于提高语言模型的事实准确性,但尚不清楚参数规模为70亿或更小的小型模型是否能够有效利用检索到的信息。为了解决这个问题,我们评估了五种模型规模,从3.6亿到80亿,涵盖了三种架构系列:SmolLM2、Qwen2.5和Llama 3.1,测试了四种检索条件,包括无检索、BM25、使用E5 large v2的稠密检索,以及保证检索到的段落包含答案的oracle检索。我们引入了一种参数化知识划分,将模型能够独立回答的问题与需要外部知识的问题分开,从而使我们能够将利用失败与检索质量失败区分开来。我们发现了三个主要结果。首先,即使在oracle检索的情况下,参数规模为70亿或更小的模型在无法独立回答的问题上也有85%到100%的时间无法提取正确答案,这表明存在根本的利用瓶颈。其次,添加检索上下文会导致模型之前已知的答案减少42%到100%,这表明上下文的存在而非其质量驱动了干扰效应。第三,对2588个oracle失败的错误分析显示,主要的失败模式是无关生成,即模型完全忽略了提供的上下文。这些模式在多种提示模板和检索方法中均保持一致。结果表明,对于参数规模低于70亿的模型,RAG的主要限制是上下文利用,而非检索质量,并且在标准评估条件下,在此规模下部署RAG可能导致净负面效应。
cs.CL / 18 / 2603.11545
One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries
一个监督者,多种模态:用于自主查询的自适应工具编排
Abstract
We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.
Chinese Translation
我们提出了一种代理式人工智能框架,用于自主的多模态查询处理,该框架协调文本、图像、音频、视频和文档模态的专业工具。中央监督者动态地分解用户查询,将子任务委派给适合的模态工具(例如,物体检测、光学字符识别(OCR)、语音转录),并通过自适应路由策略而非预定决策树综合结果。对于仅包含文本的查询,该框架通过RouteLLM学习路由,而非文本路径则使用SLM辅助的模态分解。在对15个任务类别的2,847个查询进行评估后,我们的框架在准确答案的时间上减少了72%,在对话重做上减少了85%,在成本上减少了67%,同时保持了准确性的一致性。这些结果表明,智能集中编排从根本上改善了多模态人工智能的部署经济性。
cs.CL / 19 / 2603.11564
Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries
重要性在于位置而非内容:通过位置感知伪查询解码对齐的KV缓存压缩
Abstract
The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).
Chinese Translation
键值(KV)缓存对于高效的大型语言模型(LLMs)推理至关重要,但过长的上下文会显著增加KV缓存的内存占用。现有的KV缓存压缩方法通常依赖于输入侧提示观察窗口内的注意力模式,在预填充阶段估计标记的重要性。然而,由于这些评估并不是基于解码过程得出的,它们未能保留未来生成所需的关键标记。直观上,有效的观察窗口应当反映解码阶段的查询,以准确反映生成过程将关注哪些标记。然而,在推理过程中,真实的解码查询本质上是不可用的。为了构建伪查询以近似真实查询,我们发现位置信息在此过程中比语义内容更为重要。基于这一洞察,我们提出了一种通过位置感知伪查询进行解码对齐的KV缓存压缩方法(DapQ),这是一种新颖且轻量的驱逐框架,利用位置感知伪查询来模拟输出标记,从而建立一个有效的观察窗口以进行重要性评估。该方法与实际生成上下文紧密对齐,能够实现精确的标记驱逐。在多个基准和LLMs上的广泛评估表明,DapQ在严格的内存限制下(例如,在3%的KV缓存预算下,NIAH上达到了近乎无损的99.5%性能)表现优越。
cs.CL / 20 / 2603.11578
Streaming Translation and Transcription Through Speech-to-Text Causal Alignment
通过语音到文本因果对齐进行流式翻译和转录
Abstract
Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.
Chinese Translation
传统上,同时机器翻译(SiMT)依赖于离线机器翻译模型以及人工设计的启发式方法或学习策略。我们提出了Hikari,一个无策略的完全端到端模型,通过将读取/写入决策编码到概率WAIT令牌机制中,实现同时的语音到文本翻译和流式转录。我们还引入了解码器时间膨胀(Decoder Time Dilation)机制,该机制减少了自回归开销,并确保了平衡的训练分布。此外,我们提出了一种监督微调策略,训练模型从延迟中恢复,显著改善了质量与延迟的权衡。在英语到日语、德语和俄语的评估中,Hikari在低延迟和高延迟环境下都达到了新的最先进的BLEU分数,超越了最近的基准。
cs.CL / 21 / 2603.11583
UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization
效用最大化提示:多目标大型语言模型优化的形式框架
Abstract
The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.
Chinese Translation
大型语言模型(LLM)任务的成功在很大程度上依赖于其提示。大多数用例使用自然语言来指定提示,而当必须同时满足多个目标时,自然语言本质上是模糊的。在本文中,我们引入了效用最大化提示(UtilityMax Prompting),这是一个使用形式数学语言来指定任务的框架。我们将任务重构为一个影响图,其中LLM的答案是唯一的决策变量。在图中的条件概率分布上定义了一个效用函数,并指示LLM找到最大化期望效用的答案。这使得LLM必须明确地推理目标的每个组成部分,将其输出指向一个精确的优化目标,而不是主观的自然语言解释。我们在MovieLens 1M数据集上对三种前沿模型(Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro)验证了我们的方法,展示了在多目标电影推荐任务中,相较于自然语言基线在精确度和归一化折现累积增益(NDCG)方面的一致性提升。
cs.CL / 22 / 2603.11597
Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese
开源大型语言模型在辅助日本病理报告撰写中的性能评估
Abstract
The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.
Chinese Translation
大型语言模型(LLMs)在支持日本病理报告撰写方面的性能尚未得到探索。我们从三个方面评估了七个开源LLMs的表现:(A)按照预定义格式生成和提取病理诊断文本的信息,(B)纠正日本病理报告中的排版错误,以及(C)病理学家和临床医生对模型生成的解释性文本的主观评估。思维模型和医学专业模型在需要推理的结构化报告任务和排版错误纠正方面表现出优势。相反,评审者对解释性输出的偏好差异显著。尽管LLMs的实用性因任务而异,我们的研究结果表明,开源LLMs在有限但临床相关的场景中可以有效辅助日本病理报告的撰写。
cs.CL / 23 / 2603.11650
QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
QChunker:通过多智能体辩论学习领域RAG的问答感知文本块划分
Abstract
The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval-augmentation to understanding-retrieval-augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high-quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi-path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information-rich text chunks.
Chinese Translation
检索增强生成(RAG)的有效性上限在根本上受到其知识库中文本块的语义完整性和信息粒度的限制。为了解决这些挑战,本文提出了QChunker,它将RAG范式从检索增强重构为理解-检索-增强。首先,QChunker将文本块划分建模为文本分割和知识补全的复合任务,以确保文本块的逻辑一致性和完整性。受到Hal Gregersen的“问题即答案”理论的启发,我们设计了一个多智能体辩论框架,包含四个专业组件:问题大纲生成器、文本分割器、完整性审查器和知识补全器。该框架的运作原则是问题作为深刻见解的催化剂。通过这一流程,我们成功构建了一个包含45K条目的高质量数据集,并将这一能力转移到小型语言模型上。此外,为了处理现有文本块评估方法中长评估链和低效率的问题,这些方法过于依赖下游问答任务,我们引入了一种新颖的直接评估指标ChunkScore。理论和实验验证均表明,ChunkScore能够直接且高效地区分文本块的质量。此外,在文本分割阶段,我们利用文档大纲进行多路径采样,以生成多个候选块,并采用ChunkScore选择最佳方案。在四个异构领域的广泛实验结果表明,QChunker通过提供更具逻辑一致性和信息丰富的文本块,有效解决了上述问题。
cs.CL / 24 / 2603.11665
Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
增强多模态大语言模型作为评判者的多任务强化学习
Abstract
Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.
Chinese Translation
多模态大语言模型(MLLMs)因其在各种视觉任务中与人类判断的高度一致性而被广泛应用于作为评判者的角色。然而,现有的大多数评判模型主要针对单任务场景进行优化,难以在多样化的上下文中进行泛化,这对于可靠评估至关重要。为了解决这一局限性,我们提出了多任务强化学习框架,用于MLLM作为评判者(MT-RL-Judge),该框架通过联合优化多个任务的评判模型,利用强化学习的泛化能力。实验结果表明,MT-RL-Judge在判断一致性和与人类偏好的相关性方面均优于多个强基线。此外,我们的方法在分布外任务上表现出强大的泛化能力,进一步验证了其有效性。
cs.CL / 25 / 2603.11667
A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy
面向技术的语言与翻译行业映射:分析利益相关者的价值及其对翻译教育的潜在影响
Abstract
This paper examines how value is constructed and negotiated in today's increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman's framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.
Chinese Translation
本文探讨了在当今日益自动化的语言与翻译行业中,价值是如何构建和协商的。研究基于在LT-LiDER项目中收集的来自二十九位行业利益相关者的访谈数据,分析了人类价值、技术价值、效率和适应性在不同专业角色中的表达。利用切斯特曼(Chesterman)的翻译伦理及相关价值框架作为分析视角,本文显示,面向效率的技术价值与服务伦理相一致,已成为自动化生产环境中的基本期望,在这些环境中,速度、可扩展性和可交付性主导评估标准。同时,人类价值并未被取代,而是被重新定位,主要通过嵌入技术介导工作流程中的专业知识、监督、问责和情境判断而显现。一个核心发现是适应性作为连接人类与技术领域的中介价值的重要性。适应性被构建为核心专业要求,反映了翻译人员在应对不断发展的工具和组织需求时,持续调整其技能、角色和身份的期望。本文认为,自动化重塑而非取代翻译价值,创造了一种相互依赖的配置,其中技术效率促进了人类的沟通工作。
cs.CL / 26 / 2603.11686
In the LLM era, Word Sense Induction remains unsolved
在大型语言模型时代,词义归纳仍未解决
Abstract
In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3\%. WSI is not solved, and calls for a better articulation of lexicons and LLMs' lexical semantics capabilities.
Chinese Translation
在缺乏带有词义标注的数据的情况下,词义归纳(Word Sense Induction, WSI)成为词义消歧(Word Sense Disambiguation, WSD)的一个引人注目的替代方案,特别是在资源匮乏或特定领域的环境中。本文强调了当前WSI评估中的方法论问题。我们提出在一个基于SemCor的数据集上进行评估,尊重原始语料库的多义性和频率分布。我们评估了不同词性下的预训练嵌入和聚类算法,并提出并评估了一种基于大型语言模型(LLM)的英语WSI方法。我们评估了数据增强来源(LLM生成的、语料库和词典),以及使用维基词典进行数据增强的半监督场景,包括必须链接约束、每个词条的聚类数量。我们发现没有任何无监督方法(无论是我们的还是之前的方法)超过强大的“每个词条一个聚类”(one cluster per lemma, 1cpl)启发式方法。我们还表明(i)结果和最佳系统可能因词性而异,(ii)LLM在执行此任务时存在困难,(iii)数据增强是有益的,以及(iv)利用维基词典确实有帮助。它在我们的测试集上超越了之前的最先进系统3.3%。WSI尚未解决,呼吁更好地阐明词典和LLM的词汇语义能力。
cs.CL / 27 / 2603.11687
SemBench: A Universal Semantic Framework for LLM Evaluation
SemBench:一种通用的语言模型评估语义框架
Abstract
Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.
Chinese Translation
最近,自然语言处理(NLP)领域的进展得益于大型语言模型(LLMs)的出现,这些模型展现了显著的生成和推理能力。然而,尽管取得了成功,评估这些模型的真实语义理解能力仍然是一个持续的挑战。传统基准测试如上下文中的词(Word-in-Context, WiC)有效地探测了这一能力,但其创建过程资源密集,且通常仅限于高资源语言。本文介绍了SemBench,一个自动生成合成基准测试的框架,该框架仅使用词典意义定义和句子编码器来评估LLMs的语义能力。这种方法消除了对策划示例句子的需求,使其具有可扩展性和语言独立性。我们在三种语言(英语、西班牙语和巴斯克语)中评估了SemBench,这些语言涵盖了不同的语言资源水平,并且跨越了广泛的LLMs。我们的结果显示,SemBench得出的排名与标准WiC数据集获得的排名高度相关。此外,我们的分析表明,仅需少量示例即可实现稳定且有意义的排名。总体而言,SemBench提供了一种轻量级、可适应且数据高效的框架,用于跨语言评估LLMs的语义理解能力。
cs.CL / 28 / 2603.11743
Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair
半合成平行数据用于翻译质量评估:一个关于资源匮乏语言对的数据集构建案例研究
Abstract
Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated professionally translated English-Hebrew segments from our own resources, which were assigned the highest quality score. Controlled translation errors were introduced to address linguistic challenges, particularly regarding gender and number agreement, and we trained neural QE models, including BERT and XLM-R, on this dataset to assess sentence-level MT quality. Our findings highlight the impact of dataset size, distributed balance, and error distribution on model performance. We will describe the challenges, methodology and results of our experiments, and specify future directions aimed at improving QE performance. This research contributes to advancing QE models for under resourced language pairs, including morphology-rich languages.
Chinese Translation
质量评估(QE)在机器翻译(MT)工作流程中扮演着至关重要的角色,因为它用于评估没有参考翻译的生成输出,并确定是否需要人工后编辑或完全重新翻译。然而,为资源匮乏语言对开发高度准确、适应性强且可靠的QE系统仍然是一个未解决的问题,这主要是由于平行语料库的有限性以及多种语言依赖因素,例如形态句法复杂的语言。本研究提出了一个用于英语到希伯来语QE的半合成平行数据集,该数据集通过基于典型语言模式的使用示例创建英语句子,使用多个MT引擎将其翻译为希伯来语,并通过基于BLEU的选择过滤输出。每个翻译段落由语言学家手动评估和打分,我们还结合了来自我们自身资源的专业翻译的英语-希伯来语段落,这些段落被赋予了最高的质量评分。为了应对语言挑战,特别是在性别和数的一致性方面,引入了控制翻译错误,并在该数据集上训练了包括BERT和XLM-R在内的神经QE模型,以评估句子级MT质量。我们的研究结果突显了数据集大小、分布平衡和错误分布对模型性能的影响。我们将描述实验的挑战、方法论和结果,并明确未来旨在提高QE性能的方向。本研究有助于推进资源匮乏语言对的QE模型,包括形态丰富的语言。
cs.CL / 29 / 2603.11749
Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information
压缩有利于一致性,而非真理:语言模型何时以及为何偏好正确的信息
Abstract
Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at https://github.com/Rai220/compression-drives-truth.
Chinese Translation
语言模型为何有时偏好正确的陈述,即使它们是在混合质量数据上训练的?我们提出了压缩-一致性原则:下一个词的预测偏向于那些能够对训练数据提供更短且内部一致的描述的假设。只有当错误的替代选项在结构上更难以压缩时,真理偏见才会出现。我们使用小型的 GPT-2 风格字符级变换器(参数量为 3.5M 至 86M)在合成数学语料库上进行测试,这些语料库包含了正确和错误规则的受控混合。在随机错误设置中,模型在配对评估中强烈偏好正确的补全:在平衡数据下准确率为 83.1%,即使在正确规则仅占语料库 10% 的情况下,准确率也达到了 67.0%。用一个连贯但数学上不正确的规则系统替代随机错误,基本消除了这种偏好(接近随机准确率)。在一个更类似自然语言的合成世界中,效果较弱但仍然存在(57.7%)。额外实验表明,嵌入验证步骤可以恢复对正确性的偏好,即使在小规模下,而增加一致规则的数量则会逐渐提高准确率。我们的结果表明,表面上看似的“真理偏见”在很大程度上是压缩压力和对内部一致性的偏好的副作用,而非对真理的内在驱动。完整代码和数据可在 https://github.com/Rai220/compression-drives-truth 获取。
cs.CL / 30 / 2603.11772
Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents
Legal-DC:法律文件检索增强生成的基准测试
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at https://github.com/legal-dc/Legal-DC.
Chinese Translation
检索增强生成(RAG)作为一种有前景的法律文件咨询技术,然而其在中国法律场景中的应用面临两个主要限制:现有基准缺乏对联合检索器-生成器评估的专业支持,主流的RAG系统往往无法适应法律条款的结构化特性。为了解决这些问题,本研究提出了两个核心贡献:首先,我们构建了Legal-DC基准数据集,包含480份法律文件(涵盖市场监管和合同管理等领域)和2,475对经过精炼的问题-答案对,每对均附有条款级别的参考标注,填补了中国法律RAG领域专业评估资源的空白。其次,我们提出了LegRAG框架,该框架将法律自适应索引(条款边界分割)与双路径自反机制相结合,以确保条款的完整性,同时提高答案的准确性。第三,我们引入了大语言模型的自动评估方法,以满足法律检索场景对高可靠性的需求。LegRAG在关键评估指标上比现有的最先进方法提高了1.3%到5.6%。本研究提供了一个专业的基准、实用的框架和实证见解,以推动中国法律RAG系统的发展。我们的代码和数据可在 https://github.com/legal-dc/Legal-DC 获取。
cs.CL / 31 / 2603.11778
Trust Oriented Explainable AI for Fake News Detection
面向信任的可解释人工智能在假新闻检测中的应用
Abstract
This article examines the application of Explainable Artificial Intelligence (XAI) in NLP based fake news detection and compares selected interpretability methods. The work outlines key aspects of disinformation, neural network architectures, and XAI techniques, with a focus on SHAP, LIME, and Integrated Gradients. In the experimental study, classification models were implemented and interpreted using these methods. The results show that XAI enhances model transparency and interpretability while maintaining high detection accuracy. Each method provides distinct explanatory value: SHAP offers detailed local attributions, LIME provides simple and intuitive explanations, and Integrated Gradients performs efficiently with convolutional models. The study also highlights limitations such as computational cost and sensitivity to parameterization. Overall, the findings demonstrate that integrating XAI with NLP is an effective approach to improving the reliability and trustworthiness of fake news detection systems.
Chinese Translation
本文探讨了可解释人工智能(XAI)在基于自然语言处理(NLP)的假新闻检测中的应用,并比较了选定的可解释性方法。研究概述了虚假信息的关键方面、神经网络架构和XAI技术,重点关注SHAP、LIME和集成梯度(Integrated Gradients)。在实验研究中,实施了分类模型,并使用这些方法进行了解释。结果表明,XAI增强了模型的透明度和可解释性,同时保持了高检测准确率。每种方法提供了不同的解释价值:SHAP提供详细的局部归因,LIME提供简单直观的解释,而集成梯度在卷积模型中表现高效。研究还强调了计算成本和对参数化敏感等局限性。总体而言,研究结果表明,将XAI与NLP相结合是提高假新闻检测系统可靠性和可信度的有效方法。
cs.CL / 32 / 2603.11780
Large Language Models for Biomedical Article Classification
用于生物医学文章分类的大型语言模型
Abstract
This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the na\"ive Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.
Chinese Translation
本研究系统深入地探讨了大型语言模型作为生物医学文章分类文本分类器的实用性。研究使用了多个小型和中型的开源模型,以及一些精选的闭源模型,并且在评估配置的范围上比大多数先前的研究更为全面:包括不同类型的提示、用于生成类别和类别概率预测的输出处理方法,以及少量示例的计数和选择方法。最成功配置的性能与传统分类算法进行了比较。在15个具有挑战性的数据集上,零-shot 提示的平均 PR AUC 超过 0.4,少量示例提示接近 0.5,接近于朴素贝叶斯分类器(0.5)、随机森林算法(默认设置下为 0.5 或经过超参数调优后为 0.55)和微调的变换器模型(0.5)。这些结果确认了大型语言模型作为非平凡领域文本分类器的实用性,并提供了最有前景的设置的实际建议,特别是使用输出标记概率进行类别概率预测。
cs.CL / 33 / 2603.11838
DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining
DatedGPT:通过时间感知预训练防止大型语言模型中的前瞻性偏差
Abstract
In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.
Chinese Translation
在金融回测中,预训练于互联网规模数据的大型语言模型存在引入前瞻性偏差的风险,这会削弱其预测有效性,因为它们在训练期间可能已经看到了真实结果。为了解决这一问题,我们提出了DatedGPT,一个由十二个13亿参数的语言模型组成的系列,每个模型均从头开始训练,使用约1000亿个时间分区数据的标记,严格按照2013年至2024年的年度截止日期进行划分。我们进一步通过在通用领域和金融特定数据集上进行指令微调来增强每个模型,这些数据集经过精心策划,以遵循相同的时间边界。基于困惑度的探测确认每个模型的知识有效地受到其数据截止年份的限制,而在标准基准上的评估显示出与现有相似规模模型的竞争性能。我们提供了一个交互式网络演示,允许用户查询和比较来自不同截止年份模型的响应。
cs.CL / 34 / 2603.11881
Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language
Bielik-Minitron-7B:通过结构化剪枝和知识蒸馏压缩大型语言模型以适应波兰语
Abstract
This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.
Chinese Translation
本报告详细介绍了Bielik-Minitron-7B的创建,这是一个压缩后的7.35亿参数版本的Bielik-11B-v3.0模型,特别针对欧洲语言进行了优化。我们借鉴了NVIDIA Minitron方法,采用了两阶段的压缩方法,结合了结构化混合剪枝和知识蒸馏,将模型的参数数量减少了33.4%,从11.04亿降至7.35亿。我们使用NVIDIA模型优化器进行结构剪枝,并利用NVIDIA NeMo框架进行基于logit的蒸馏以恢复质量。在蒸馏之后,模型经历了一套严格的对齐流程,包括监督微调(SFT)、直接偏好优化(DPO-P)和强化学习(GRPO)。我们的最终模型成功恢复了基线模型性能的约90%,同时提供了高达50%的推理速度提升。这种方法展示了为代表性较少的语言创建语言模型的有效途径,在降低推理部署成本的同时保持了原始模型的质量。
cs.CL / 35 / 2603.11915
CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?
CoMMET:大型语言模型在心智理论任务中的表现程度如何?
Abstract
Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.
Chinese Translation
心智理论(Theory of Mind, ToM)——推理自己和他人心理状态的能力——是人类社会智能的基石。随着大型语言模型(Large Language Models, LLMs)在现实应用中的普及,验证它们在这一层次的社会推理能力对于实现有效和自然的互动至关重要。然而,现有的评估LLMs心智理论能力的基准有限;大多数仅依赖文本输入,并且狭窄地集中于与信念相关的任务。在本文中,我们提出了一个新的多模态基准数据集CoMMET(综合心理状态与道德评估任务),该数据集受到心智理论手册任务的启发。CoMMET通过涵盖更广泛的心理状态并引入多轮测试,扩展了评估的范围。据我们所知,这是第一个在多轮对话环境中评估心智理论的多模态数据集。通过对不同家族和规模的LLMs进行全面评估,我们分析了当前模型的优缺点,并确定了未来改进的方向。我们的工作提供了对现代LLMs社会认知能力的更深入理解。
cs.CL / 36 / 2603.11955
PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents
PersonaTrace:利用大型语言模型代理合成真实的数字足迹
Abstract
Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.
Chinese Translation
数字足迹(个人与数字系统交互的记录)对于研究行为、开发个性化应用和训练机器学习模型至关重要。然而,该领域的研究常常受到多样性和可获取数据稀缺的限制。为了解决这一问题,我们提出了一种利用大型语言模型(LLM)代理合成真实数字足迹的新方法。我们的方案从一个结构化的用户档案出发,生成多样且合理的用户事件序列,最终产生相应的数字文档,如电子邮件、消息、日历条目、提醒等。内在评估结果表明,生成的数据集在多样性和真实性上优于现有基准。此外,在真实世界的分布外任务评估中,基于我们的合成数据微调的模型表现优于那些在其他合成数据集上训练的模型。
cs.CL / 37 / 2603.11957
CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
CHiL(L)Grader:校准的人机协作短答案评分系统
Abstract
Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.
Chinese Translation
利用大型语言模型扩展教育评估不仅需要准确性,还需要能够识别预测何时可信。经过指令调优的模型往往过于自信,且随着课程的演变,其可靠性下降,使得在高风险环境中完全自主的部署变得不安全。我们提出了CHiL(L)Grader,这是第一个将校准的置信度估计纳入人机协作工作流程的自动评分框架。通过后期温度缩放、基于置信度的选择性预测和持续学习,CHiL(L)Grader仅自动评分高置信度的预测,而将不确定的案例转交给人工评分,并适应不断演变的评分标准和未见过的问题。在三个短答案评分数据集上,CHiL(L)Grader自动对35-65%的回答进行专家级质量评分(QWK >= 0.80)。接受和拒绝预测之间的QWK差距为0.347,确认了基于置信度的路由的有效性。每个纠正周期都增强了模型的评分能力,因为它从教师反馈中学习。这些结果表明,不确定性量化是可靠的AI辅助评分的关键。
cs.CL / 38 / 2603.11991
BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs
BTZSC:跨交叉编码器、嵌入模型、重排序器和大型语言模型的零样本文本分类基准
Abstract
Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
Chinese Translation
零样本文本分类(ZSC)有望通过直接将文本与人类可读的标签描述匹配,从而消除昂贵的任务特定注释。尽管早期的方法主要依赖于为自然语言推理(NLI)微调的交叉编码器模型,但最近在文本嵌入模型、重排序器和指令调优的大型语言模型(LLMs)方面的进展挑战了基于NLI架构的主导地位。然而,系统地比较这些不同的方法仍然困难。现有的评估,如MTEB,通常通过监督探测或微调纳入标记示例,导致真正的零样本能力未得到充分探索。为了解决这一问题,我们引入了BTZSC,这是一个涵盖情感、主题、意图和情感分类的22个公共数据集的综合基准,捕捉了多样的领域、类别基数和文档长度。利用BTZSC,我们对四大模型家族进行了系统比较,包括NLI交叉编码器、嵌入模型、重排序器和指令调优的LLMs,涵盖了38个公共和自定义检查点。我们的结果显示:(i)现代重排序器,如Qwen3-Reranker-8B,设定了新的最先进水平,宏观F1达到0.72;(ii)强大的嵌入模型,如GTE-large-en-v1.5,显著缩小了准确性差距,同时在准确性和延迟之间提供了最佳权衡;(iii)参数在4到12B之间的指令调优LLMs实现了竞争力的表现(宏观F1高达0.67),在主题分类上表现尤为出色,但落后于专业重排序器;(iv)NLI交叉编码器即使在主干模型尺寸增加时也达到饱和;(v)扩展主要使重排序器和LLMs受益,而非嵌入模型。BTZSC及其评估代码已公开发布,以支持零样本文本理解的公平和可重复进展。
cs.CL / 39 / 2603.12021
Just Use XML: Revisiting Joint Translation and Label Projection
仅使用 XML:重新审视联合翻译与标签投影
Abstract
Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +39.9 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.
Chinese Translation
标签投影是一种有效的跨语言迁移技术,它将跨度注释的数据集从高资源语言扩展到低资源语言。大多数方法将标签投影作为机器翻译之后的一个独立步骤进行,而之前的研究表明将两者结合会导致翻译质量下降。我们通过 LabelPigeon 这一新颖框架重新评估了这一说法,该框架通过 XML 标签联合执行翻译和标签投影。我们设计了一种直接评估标签投影的方案,发现 LabelPigeon 在 11 种语言中优于基线,并积极提高翻译质量。我们进一步评估了 203 种语言的翻译质量及其不同的注释复杂性,发现由于额外的微调而导致的一致性改进。最后,在 27 种语言和三个下游任务中,我们报告了相较于可比工作的跨语言迁移显著提升,NER 上的 F1 值提高了高达 +39.9。总体而言,我们的结果表明,带有 XML 标签的标签投影在不影响翻译质量的情况下提供了有效且高效的标签迁移。
cs.CL / 40 / 2603.12050
Translationese as a Rational Response to Translation Task Difficulty
翻译语言作为对翻译任务难度的理性回应
Abstract
Translations systematically diverge from texts originally produced in the target language, a phenomenon widely referred to as translationese. Translationese has been attributed to production tendencies (e.g. interference, simplification), socio-cultural variables, and language-pair effects, yet a unified explanatory account is still lacking. We propose that translationese reflects cognitive load inherent in the translation task itself. We test whether observable translationese can be predicted from quantifiable measures of translation task difficulty. Translationese is operationalised as a segment-level translatedness score produced by an automatic classifier. Translation task difficulty is conceptualised as comprising source-text and cross-lingual transfer components, operationalised mainly through information-theoretic metrics based on LLM surprisal, complemented by established syntactic and semantic alternatives. We use a bidirectional English-German corpus comprising written and spoken subcorpora. Results indicate that translationese can be partly explained by translation task difficulty, especially in English-to-German. For most experiments, cross-lingual transfer difficulty contributes more than source-text complexity. Information-theoretic indicators match or outperform traditional features in written mode, but offer no advantage in spoken mode. Source-text syntactic complexity and translation-solution entropy emerged as the strongest predictors of translationese across language pairs and modes.
Chinese Translation
翻译文本系统性地偏离了原本在目标语言中产生的文本,这一现象被广泛称为翻译语言(translationese)。翻译语言被归因于生产倾向(例如干扰、简化)、社会文化变量和语言对效应,但仍缺乏统一的解释性说明。我们提出翻译语言反映了翻译任务本身固有的认知负荷。我们测试可观察的翻译语言是否可以通过可量化的翻译任务难度指标进行预测。翻译语言被操作化为由自动分类器生成的段落级翻译度评分。翻译任务难度被概念化为包括源文本和跨语言转移成分,主要通过基于大语言模型(LLM)惊讶度的信息论指标进行操作,同时辅以已建立的句法和语义替代方案。我们使用了一个双向英语-德语语料库,包括书面和口语子语料库。结果表明,翻译语言在一定程度上可以通过翻译任务难度来解释,尤其是在英语到德语的翻译中。在大多数实验中,跨语言转移难度的贡献大于源文本复杂性。信息论指标在书面模式下与传统特征相匹配或表现更优,但在口语模式下没有优势。源文本的句法复杂性和翻译解决方案的熵成为跨语言对和模式中翻译语言的最强预测因素。
cs.CL / 41 / 2603.12105
To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times
超越文字:探究大型语言模型在句子层面上对可记忆性和阅读时间的心理语言学规范的评估
Abstract
Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.
Chinese Translation
大型语言模型(LLMs)最近被证明能够为单词和多词表达生成心理语言学规范的估计,例如情感、唤醒度或具体性,这些估计与人类判断相关。这些估计是通过以零样本方式提示LLM,提出与人类研究中使用的问题相似的问题而获得的。同时,对于其他规范,如词汇决策时间或习得年龄,LLMs需要监督微调才能获得与真实值一致的结果。在本文中,我们将这种方法扩展到之前未研究的句子可记忆性和阅读时间特征,这涉及句子层面上下文中多个单词之间的关系。我们的结果表明,通过微调,模型能够提供与人类派生规范相关的估计,并超越可解释基线预测器的预测能力,证明LLMs包含关于句子层面特征的有用信息。同时,我们的结果显示出非常混合的零样本和少样本表现,进一步证明在使用LLM提示作为人类认知测量的代理时需要谨慎。
cs.CL / 42 / 2603.12117
SommBench: Assessing Sommelier Expertise of Language Models
SommBench:评估语言模型的侍酒师专业知识
Abstract
With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at https://github.com/sommify/sommbench.
Chinese Translation
随着大型语言模型的快速发展,系统性评估其多语言和多文化能力变得越来越重要。以往的文化评估基准主要关注可以用语言形式编码的基本文化知识。在此,我们提出了SommBench,一个多语言基准,用于评估侍酒师专业知识,这一领域深深植根于嗅觉和味觉的感知。虽然语言模型仅通过文本描述来学习感官属性,但SommBench测试这种文本基础是否足以模拟专家级的感官判断。SommBench包括三个主要任务:葡萄酒理论问答(Wine Theory Question Answering,WTQA)、葡萄酒特征补全(Wine Feature Completion,WFC)和食物与葡萄酒配对(Food-Wine Pairing,FWP)。SommBench支持多种语言:英语、斯洛伐克语、瑞典语、芬兰语、德语、丹麦语、意大利语和西班牙语。这有助于将语言模型的葡萄酒专业知识与其语言能力区分开来。基准数据集是在与专业侍酒师及相关语言的母语者密切合作下开发的,结果包括1,024个葡萄酒理论问答问题、1,000个葡萄酒特征补全示例和1,000个食物与葡萄酒配对示例。我们提供了最流行语言模型的结果,包括封闭权重模型如Gemini 2.5,以及开放权重模型如GPT-OSS和Qwen 3。我们的结果显示,最强大的模型在葡萄酒理论问答上表现良好(封闭权重模型的正确率高达97%),然而在特征补全(最高达到65%)和食物与葡萄酒配对(MCC范围在0到0.39之间)上则显得更具挑战性。这些结果使SommBench成为评估语言模型侍酒师专业知识的有趣且具有挑战性的基准。该基准公开可用,网址为https://github.com/sommify/sommbench。
cs.CL / 43 / 2603.12123
Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions
跨上下文审查:通过分离生成和审查会话来提高大型语言模型输出质量
Abstract
Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.
Chinese Translation
大型语言模型在同一会话中审查其自身输出时,难以捕捉到错误。本文提出了跨上下文审查(Cross-Context Review, CCR),这是一种简单的方法,其中审查在一个新的会话中进行,且无法访问生成对话的历史记录。我们进行了一个对照实验:测试了30个工件(代码、技术文档、演示稿),注入了150个错误,分别在四种审查条件下进行测试——同会话自我审查(Self-Review, SR)、重复自我审查(Self-Review, SR2)、上下文感知子代理审查(Subagent Review, SA)和跨上下文审查(CCR)。在360次审查中,CCR达到了28.6%的F1值,优于SR(24.6%,p=0.008,d=0.52)、SR2(21.7%,p<0.001,d=0.72)和SA(23.8%,p=0.004,d=0.57)。SR2的结果在解释上尤为重要:在同一会话中审查两次并未优于审查一次(p=0.11),这排除了重复作为CCR优势的解释。其益处来自于上下文的分离。CCR适用于任何模型,无需基础设施,仅需额外一个会话的成本。
cs.CL / 44 / 2603.12152
LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation
LifeSim:用于个性化助手评估的长时间用户生活模拟器
Abstract
The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.
Chinese Translation
大型语言模型(LLMs)的快速发展加速了通用人工智能助手的进步。然而,现有的个性化助手基准与现实世界的用户-助手交互存在不匹配,未能捕捉外部环境和用户认知状态的复杂性。为了解决这一问题,我们提出了LifeSim,一个通过信念-欲望-意图(Belief-Desire-Intention, BDI)模型在物理环境中模拟用户认知的用户模拟器,以生成连贯的生活轨迹,并模拟以意图驱动的用户交互行为。基于LifeSim,我们引入了LifeSim-Eval,一个用于多场景、长时间个性化辅助的综合基准。LifeSim-Eval涵盖了8个生活领域和1200个多样化场景,并采用多轮交互方法来评估模型完成显性和隐性意图、恢复用户档案以及生成高质量响应的能力。在单场景和长时间设置下,我们的实验表明,当前的LLMs在处理隐性意图和长期用户偏好建模方面面临重大限制。
cs.CL / 45 / 2603.12165
QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions
QAQ:用于选择高质量合成代码指令的双向语义一致性
Abstract
Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.
Chinese Translation
合成数据已成为训练代码生成模型的重要组成部分,但它引入了显著的噪声和幻觉,这些问题在当前的度量标准下难以检测。现有的数据选择方法,如指令遵循难度(Instruction-Following Difficulty, IFD),通常评估模型在给定查询($A|Q$)时生成答案的难度。然而,在噪声合成数据中,这一度量标准存在模糊性,低概率可能无法区分内在任务复杂性与模型生成的幻觉。在此,我们提出了QAQ,一种新颖的数据选择框架,从反向的角度评估数据质量:答案能多好地预测查询($Q|A$)?我们定义了反向互信息(Reverse Mutual Information, RMI)来量化在给定答案的条件下对查询的信息增益。我们的分析表明,RMI的两个极端都暗示了质量问题:低RMI表示语义不一致,而过高的RMI可能包含LLMs容易识别的缺陷模式。此外,我们引入了一种基于强模型与弱模型之间不一致的选择策略,以识别有效但具有挑战性的样本。在WarriorCoder数据集上的实验表明,仅使用分层RMI选择25%的数据即可实现与全数据训练相当的性能,显著优于现有的数据选择方法。我们的方法强调了合成数据策划中双向语义一致性的重要性,为在不牺牲模型能力的情况下降低计算成本提供了一条可扩展的途径。
cs.CL / 46 / 2603.12180
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections
战略导航还是随机搜索?代理与人类如何推理文档集合
Abstract
Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
Chinese Translation
多模态代理为自动化复杂的文档密集型工作流程提供了一条有前景的路径。然而,一个关键问题仍然存在:这些代理是否展现出真正的战略推理,还是仅仅依赖随机的试错搜索?为了解决这个问题,我们引入了MADQA,这是一个基于800个异构PDF文档的2250个由人类撰写的问题基准。我们根据经典测验理论进行设计,以最大化在不同代理能力水平下的区分能力。为了评估代理行为,我们引入了一种新颖的评估协议,测量准确性与努力之间的权衡。利用这一框架,我们展示了尽管最佳代理在原始准确性上可以与人类搜索者相匹配,但它们成功的主要问题却大相径庭,并依赖于蛮力搜索来弥补战略规划的不足。它们未能缩小与理想表现之间近20%的差距,仍然在无效循环中徘徊。我们发布了数据集和评估工具,以帮助促进从蛮力检索到经过校准的高效推理的过渡。
cs.CL / 47 / 2603.12191
Long-Context Encoder Models for Polish Language Understanding
用于波兰语理解的长上下文编码器模型
Abstract
While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.
Chinese Translation
尽管仅解码的大型语言模型(LLMs)最近在自然语言处理(NLP)领域占据主导地位,但仅编码的架构仍然是用于区分任务的经济高效和参数高效的标准。然而,经典编码器如BERT受到短上下文窗口的限制,这不足以处理长文档。在本文中,我们通过引入一种高质量的波兰语模型来解决这一限制,该模型能够处理长达8192个标记的序列。该模型的开发采用了两阶段的训练程序,包括位置嵌入适配和全参数连续预训练。此外,我们提出了通过知识蒸馏训练的压缩模型变体。这些模型在25个任务上进行了评估,包括KLEJ基准、新引入的金融任务套件(FinBench)以及其他分类和回归任务,特别是那些需要长文档理解的任务。结果表明,我们的模型在波兰语和多语言模型中实现了最佳的平均性能,在长上下文任务中显著优于竞争解决方案,同时在短文本上的质量保持了可比性。
cs.CL / 48 / 2603.12201
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
IndexCache:通过跨层索引重用加速稀疏注意力
Abstract
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
Chinese Translation
长上下文的代理工作流已成为大型语言模型的一个重要应用场景,使得注意力效率对推理速度和服务成本至关重要。稀疏注意力有效地解决了这一挑战,而深度寻址稀疏注意力(DeepSeek Sparse Attention, DSA)是一个具有代表性的生产级解决方案:一个轻量级的闪电索引器为每个查询选择前k个最相关的标记,将核心注意力从 $O(L^2)$ 降低到 $O(Lk)$。然而,索引器本身仍然保持 $O(L^2)$ 的复杂度,并且必须在每一层独立运行,尽管结果的前k个选择在连续层之间高度相似。我们提出了IndexCache,它通过将层划分为一小组运行自身索引器的完整层和大多数简单重用最近完整层的前k个索引的共享层,利用了这种跨层冗余。我们提出了两种互补的方法来确定和优化这一配置。无训练的IndexCache应用了一种贪心搜索算法,通过直接最小化校准集上的语言建模损失来选择保留索引器的层,不需要权重更新。考虑训练的IndexCache引入了一种多层蒸馏损失,训练每个保留的索引器与其服务的所有层的平均注意力分布进行对比,使得即使是简单的交错模式也能匹配完整索引器的准确性。在一个30B DSA模型上的实验结果表明,IndexCache可以去除75%的索引器计算,且质量下降微乎其微,与标准DSA相比,实现了高达1.82$ imes$的预填充加速和1.48$ imes$的解码加速。这些积极的结果在我们的生产规模GLM-5模型的初步实验中得到了进一步验证(见图1)。
cs.CL / 49 / 2603.12206
CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks
CLASP:防御混合大型语言模型对隐藏状态中毒攻击的威胁
Abstract
State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening r\'esum\'es to identify the best candidates for a role. Evaluated on a corpus of 2,483 r\'esum\'es totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.
Chinese Translation
状态空间模型(SSMs)如Mamba作为高效的Transformer替代方案,已获得显著关注,能够实现线性复杂度并保持竞争力的性能。然而,隐藏状态中毒攻击(HiSPAs)是一种最近发现的漏洞,通过对抗字符串破坏SSM的内存,对这些架构及其混合变体构成了严重威胁。我们将HiSPA缓解任务框定为一个基于令牌的二元分类问题,提出了CLASP模型以应对这一威胁。CLASP利用Mamba的块输出嵌入(BOEs)中的独特模式,并使用XGBoost分类器以最小的计算开销识别恶意令牌。我们考虑了一个现实场景,其中SSMs和HiSPAs可能同时被使用:一个大型语言模型(LLM)筛选简历以识别最佳候选人。在对2483份简历(总计9.5M令牌)进行控制注入的语料库上进行评估时,CLASP在恶意令牌检测上实现了95.9%的令牌级F1分数和99.3%的文档级F1分数。重要的是,该模型能够推广到未见过的攻击模式:在留一交叉验证下,性能保持高水平(96.9%文档级F1),而在具有结构新颖触发器的聚类交叉验证下,仍保持有用的检测能力(91.6%平均文档级F1)。CLASP独立于任何下游模型,处理速度为每秒1032个令牌,且VRAM消耗低于4GB,这使其有可能作为轻量级前线防御适用于基于SSM和混合架构的实际部署。所有代码和详细结果可在https://anonymous.4open.science/r/hispikes-91C0获取。
cs.CL / 50 / 2603.12226
Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration
通过大型语言模型驱动的跨学科启发激发科学创造力
Abstract
Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.
Chinese Translation
尽管跨学科研究能够带来更大且长期的影响,但大多数工作仍局限于单一领域的学术孤岛。最近基于人工智能的科学发现方法在跨学科研究中展现出潜力,但许多方法优先考虑快速设计实验和解决方案,忽视了推动创造性跨学科突破的探索性、协作性推理过程。因此,之前的努力主要集中在自动化科学发现,而非增强科学颠覆背后的推理过程。我们提出了Idea-Catalyst,一个新颖的框架,系统性地识别跨学科见解,以支持人类和大型语言模型的创造性推理。Idea-Catalyst从一个抽象的研究目标出发,旨在辅助头脑风暴阶段,明确避免对特定解决方案的过早固定。该框架体现了跨学科推理的关键元认知特征:(a) 定义和评估研究目标,(b) 了解某一领域的机会和未解决的挑战,以及(c) 基于影响潜力的跨学科思想的战略探索。具体而言,Idea-Catalyst将一个抽象目标(例如,改善人类与人工智能的协作)分解为核心目标领域的研究问题,从而指导对该领域内进展和开放挑战的分析。这些挑战被重新表述为领域无关的概念问题,使得能够从外部学科(例如,心理学、社会学)中检索到解决类似问题的见解。通过将这些领域的见解综合并重新语境化回目标领域,Idea-Catalyst根据其跨学科潜力对源领域进行排名。从实证上看,这种有针对性的整合使得平均新颖性提高了21%,洞察力提高了16%,同时仍然扎根于原始研究问题。
cs.CL / 51 / 2603.12249
SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning
SciMDR:科学多模态文档推理的基准测试与进展
Abstract
Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.
Chinese Translation
构建用于基础模型训练的科学多模态文档推理数据集涉及规模、真实性和现实性之间的固有权衡。为了解决这一挑战,我们提出了合成与再定位框架,这是一个由两个阶段组成的流程: (1) 以主张为中心的问答合成(Claim-Centric QA Synthesis),生成真实、独立的问答对并在聚焦段落上进行推理; (2) 文档级再定位(Document-Scale Regrounding),程序性地将这些对重新嵌入到完整文档任务中,以确保现实的复杂性。利用该框架,我们构建了SciMDR,这是一个用于跨模态理解的大规模训练数据集,包含30万个具有明确推理链的问答对,覆盖2万个科学论文。此外,我们还构建了SciMDR-Eval,这是一个专家注释的基准,用于评估完整科学工作流程中的多模态理解。实验表明,在SciMDR上微调的模型在多个科学问答基准上取得了显著改善,特别是在那些需要复杂文档级推理的任务中。