cs.RO / 1 / 2602.15060
CLOT: Closed-Loop Global Motion Tracking for Whole-Body Humanoid Teleoperation
CLOT:用于全身类人机器人遥操作的闭环全局运动跟踪
Abstract
Long-horizon whole-body humanoid teleoperation remains challenging due to accumulated global pose drift, particularly on full-sized humanoids. Although recent learning-based tracking methods enable agile and coordinated motions, they typically operate in the robot's local frame and neglect global pose feedback, leading to drift and instability during extended execution. In this work, we present CLOT, a real-time whole-body humanoid teleoperation system that achieves closed-loop global motion tracking via high-frequency localization feedback. CLOT synchronizes operator and robot poses in a closed loop, enabling drift-free human-to-humanoid mimicry over long timehorizons. However, directly imposing global tracking rewards in reinforcement learning, often results in aggressive and brittle corrections. To address this, we propose a data-driven randomization strategy that decouples observation trajectories from reward evaluation, enabling smooth and stable global corrections. We further regularize the policy with an adversarial motion prior to suppress unnatural behaviors. To support CLOT, we collect 20 hours of carefully curated human motion data for training the humanoid teleoperation policy. We design a transformer-based policy and train it for over 1300 GPU hours. The policy is deployed on a full-sized humanoid with 31 DoF (excluding hands). Both simulation and real-world experiments verify high-dynamic motion, high-precision tracking, and strong robustness in sim-to-real humanoid teleoperation. Motion data, demos and code can be found in our website.
Chinese Translation
长时间的全身类人机器人遥操作仍然面临挑战,尤其是在全尺寸类人机器人上,由于全局姿态漂移的累积。尽管最近的基于学习的跟踪方法能够实现灵活和协调的运动,但它们通常在机器人的局部坐标系中运行,忽视了全局姿态反馈,导致在长时间执行过程中出现漂移和不稳定。在本研究中,我们提出了CLOT,一个实时的全身类人机器人遥操作系统,通过高频率的定位反馈实现闭环全局运动跟踪。CLOT在闭环中同步操作员和机器人姿态,使得人类与类人机器人之间的模仿在长时间范围内无漂移。然而,直接在强化学习中施加全局跟踪奖励,往往会导致激进和脆弱的修正。为了解决这个问题,我们提出了一种数据驱动的随机化策略,将观察轨迹与奖励评估解耦,从而实现平滑和稳定的全局修正。我们进一步通过对抗性运动先验来正则化策略,以抑制不自然的行为。为了支持CLOT,我们收集了20小时精心策划的人类运动数据,用于训练类人机器人遥操作策略。我们设计了一种基于变换器的策略,并进行了超过1300小时的GPU训练。该策略在一个具有31个自由度(不包括手部)的全尺寸类人机器人上部署。仿真和现实世界实验验证了在类人机器人遥操作中的高动态运动、高精度跟踪和强鲁棒性。运动数据、演示和代码可以在我们的网站上找到。
cs.RO / 2 / 2602.15061
Safe-SDL:Establishing Safety Boundaries and Control Mechanisms for AI-Driven Self-Driving Laboratories
安全自驾实验室(Safe-SDL):为人工智能驱动的自驾实验室建立安全边界和控制机制
Abstract
The emergence of Self-Driving Laboratories (SDLs) transforms scientific discovery methodology by integrating AI with robotic automation to create closed-loop experimental systems capable of autonomous hypothesis generation, experimentation, and analysis. While promising to compress research timelines from years to weeks, their deployment introduces unprecedented safety challenges differing from traditional laboratories or purely digital AI. This paper presents Safe-SDL, a comprehensive framework for establishing robust safety boundaries and control mechanisms in AI-driven autonomous laboratories. We identify and analyze the critical ``Syntax-to-Safety Gap'' -- the disconnect between AI-generated syntactically correct commands and their physical safety implications -- as the central challenge in SDL deployment. Our framework addresses this gap through three synergistic components: (1) formally defined Operational Design Domains (ODDs) that constrain system behavior within mathematically verified boundaries, (2) Control Barrier Functions (CBFs) that provide real-time safety guarantees through continuous state-space monitoring, and (3) a novel Transactional Safety Protocol (CRUTD) that ensures atomic consistency between digital planning and physical execution. We ground our theoretical contributions through analysis of existing implementations including UniLabOS and the Osprey architecture, demonstrating how these systems instantiate key safety principles. Evaluation against the LabSafety Bench reveals that current foundation models exhibit significant safety failures, demonstrating that architectural safety mechanisms are essential rather than optional. Our framework provides both theoretical foundations and practical implementation guidance for safe deployment of autonomous scientific systems, establishing the groundwork for responsible acceleration of AI-driven discovery.
Chinese Translation
自驾实验室(Self-Driving Laboratories, SDLs)的出现通过将人工智能与机器人自动化相结合,改变了科学发现的方法论,创造了能够自主生成假设、进行实验和分析的闭环实验系统。尽管其承诺将研究时间从数年压缩至数周,但其部署引入了与传统实验室或纯数字人工智能不同的前所未有的安全挑战。本文提出了安全自驾实验室(Safe-SDL),这是一个为人工智能驱动的自主实验室建立稳健安全边界和控制机制的综合框架。我们识别并分析了关键的“语法到安全的差距”(Syntax-to-Safety Gap)——即人工智能生成的语法正确命令与其物理安全含义之间的脱节——作为SDL部署中的核心挑战。我们的框架通过三个协同组件来解决这一差距:(1) 正式定义的操作设计领域(Operational Design Domains, ODDs),在数学验证的边界内约束系统行为;(2) 控制障碍函数(Control Barrier Functions, CBFs),通过持续的状态空间监控提供实时安全保障;(3) 一种新颖的事务安全协议(Transactional Safety Protocol, CRUTD),确保数字规划与物理执行之间的原子一致性。我们通过对现有实现的分析,包括UniLabOS和Osprey架构,来支撑我们的理论贡献,展示这些系统如何体现关键的安全原则。针对LabSafety Bench的评估表明,当前的基础模型表现出显著的安全失败,证明了架构安全机制是必不可少的,而非可选的。我们的框架为安全部署自主科学系统提供了理论基础和实践实施指导,为负责任地加速人工智能驱动的发现奠定了基础。
cs.RO / 3 / 2602.15063
How Do We Research Human-Robot Interaction in the Age of Large Language Models? A Systematic Review
在大型语言模型时代,我们如何研究人机交互?一项系统性综述
Abstract
Advances in large language models (LLMs) are profoundly reshaping the field of human-robot interaction (HRI). While prior work has highlighted the technical potential of LLMs, few studies have systematically examined their human-centered impact (e.g., human-oriented understanding, user modeling, and levels of autonomy), making it difficult to consolidate emerging challenges in LLM-driven HRI systems. Therefore, we conducted a systematic literature search following the PRISMA guideline, identifying 86 articles that met our inclusion criteria. Our findings reveal that: (1) LLMs are transforming the fundamentals of HRI by reshaping how robots sense context, generate socially grounded interactions, and maintain continuous alignment with human needs in embodied settings; and (2) current research is largely exploratory, with different studies focusing on different facets of LLM-driven HRI, resulting in wide-ranging choices of experimental setups, study methods, and evaluation metrics. Finally, we identify key design considerations and challenges, offering a coherent overview and guidelines for future research at the intersection of LLMs and HRI.
Chinese Translation
大型语言模型(LLMs)的进步正在深刻重塑人机交互(HRI)领域。尽管先前的研究强调了LLMs的技术潜力,但很少有研究系统性地考察其以人为中心的影响(例如,以人为本的理解、用户建模和自主性水平),这使得整合LLM驱动的HRI系统中出现的新挑战变得困难。因此,我们遵循PRISMA指南进行了系统的文献检索,识别出86篇符合我们纳入标准的文章。我们的研究结果显示:(1)LLMs正在通过重塑机器人感知上下文、生成社会化互动以及在具身环境中与人类需求保持持续一致性,转变HRI的基础;(2)当前的研究主要是探索性的,不同的研究集中在LLM驱动的HRI的不同方面,导致实验设置、研究方法和评估指标的选择差异很大。最后,我们识别出关键的设计考虑因素和挑战,为未来在LLMs与HRI交叉领域的研究提供了连贯的概述和指导。
cs.RO / 4 / 2602.15092
Augmenting Human Balance with Generic Supernumerary Robotic Limbs
通过通用超数机器人肢体增强人类平衡
Abstract
Supernumerary robotic limbs (SLs) have the potential to transform a wide range of human activities, yet their usability remains limited by key technical challenges, particularly in ensuring safety and achieving versatile control. Here, we address the critical problem of maintaining balance in the human-SLs system, a prerequisite for safe and comfortable augmentation tasks. Unlike previous approaches that developed SLs specifically for stability support, we propose a general framework for preserving balance with SLs designed for generic use. Our hierarchical three-layer architecture consists of: (i) a prediction layer that estimates human trunk and center of mass (CoM) dynamics, (ii) a planning layer that generates optimal CoM trajectories to counteract trunk movements and computes the corresponding SL control inputs, and (iii) a control layer that executes these inputs on the SL hardware. We evaluated the framework with ten participants performing forward and lateral bending tasks. The results show a clear reduction in stance instability, demonstrating the framework's effectiveness in enhancing balance. This work paves the path towards safe and versatile human-SLs interactions. [This paper has been submitted for publication to IEEE.]
Chinese Translation
超数机器人肢体(SLs)有潜力改变广泛的人类活动,但其可用性仍受到关键技术挑战的限制,特别是在确保安全性和实现多功能控制方面。在此,我们解决了人类与SLs系统中维持平衡的关键问题,这是安全和舒适增强任务的前提。与以往专门为稳定性支持开发SLs的方法不同,我们提出了一个通用框架,用于利用设计为通用用途的SLs保持平衡。我们的分层三层架构包括:(i)预测层,用于估计人类躯干和质心(CoM)动态;(ii)规划层,生成最佳的CoM轨迹以抵消躯干运动,并计算相应的SL控制输入;(iii)控制层,在SL硬件上执行这些输入。我们通过十名参与者进行前屈和侧屈任务评估了该框架。结果显示站立不稳定性明显降低,证明了该框架在增强平衡方面的有效性。这项工作为安全和多功能的人类与SLs交互铺平了道路。[本文已提交至IEEE发表。]
cs.RO / 5 / 2602.15162
A ROS2 Benchmarking Framework for Hierarchical Control Strategies in Mobile Robots for Mediterranean Greenhouses
用于地中海温室移动机器人分层控制策略的ROS2基准测试框架
Abstract
Mobile robots operating in agroindustrial environments, such as Mediterranean greenhouses, are subject to challenging conditions, including uneven terrain, variable friction, payload changes, and terrain slopes, all of which significantly affect control performance and stability. Despite the increasing adoption of robotic platforms in agriculture, the lack of standardized, reproducible benchmarks impedes fair comparisons and systematic evaluations of control strategies under realistic operating conditions. This paper presents a comprehensive benchmarking framework for evaluating mobile robot controllers in greenhouse environments. The proposed framework integrates an accurate three dimensional model of the environment, a physics based simulator, and a hierarchical control architecture comprising low, mid, and high level control layers. Three benchmark categories are defined to enable modular assessment, ranging from actuator level control to full autonomous navigation. Additionally, three disturbance scenarios payload variation, terrain type, and slope are explicitly modeled to replicate real world agricultural conditions. To ensure objective and reproducible evaluation, standardized performance metrics are introduced, including the Squared Absolute Error (SAE), the Squared Control Input (SCI), and composite performance indices. Statistical analysis based on repeated trials is employed to mitigate the influence of sensor noise and environmental variability. The framework is further enhanced by a plugin based architecture that facilitates seamless integration of user defined controllers and planners. The proposed benchmark provides a robust and extensible tool for the quantitative comparison of classical, predictive, and planning based control strategies in realistic conditions, bridging the gap between simulation based analysis and real world agroindustrial applications.
Chinese Translation
在农业工业环境中运行的移动机器人,如地中海温室,面临着不平坦的地形、可变摩擦、载荷变化和地形坡度等挑战性条件,这些因素显著影响控制性能和稳定性。尽管农业中对机器人平台的采用日益增加,但缺乏标准化和可重复的基准测试妨碍了在现实操作条件下对控制策略的公平比较和系统评估。本文提出了一个全面的基准测试框架,用于评估温室环境中的移动机器人控制器。该框架集成了环境的精确三维模型、基于物理的模拟器以及包含低、中和高层控制层的分层控制架构。定义了三个基准类别,以实现模块化评估,从执行器级控制到完全自主导航。此外,明确建模了三种干扰场景:载荷变化、地形类型和坡度,以复制现实世界的农业条件。为了确保客观和可重复的评估,引入了标准化的性能指标,包括平方绝对误差(Squared Absolute Error, SAE)、平方控制输入(Squared Control Input, SCI)和复合性能指数。采用基于重复试验的统计分析来减轻传感器噪声和环境变异性的影响。该框架通过插件架构进一步增强,便于无缝集成用户定义的控制器和规划器。所提出的基准测试为在现实条件下对经典、预测和基于规划的控制策略进行定量比较提供了一个强大且可扩展的工具,弥合了基于模拟的分析与现实世界农业工业应用之间的差距。
cs.RO / 6 / 2602.15201
DexEvolve: Evolutionary Optimization for Robust and Diverse Dexterous Grasp Synthesis
DexEvolve:用于鲁棒且多样化灵巧抓取合成的进化优化
Abstract
Dexterous grasping is fundamental to robotics, yet data-driven grasp prediction heavily relies on large, diverse datasets that are costly to generate and typically limited to a narrow set of gripper morphologies. Analytical grasp synthesis can be used to scale data collection, but necessary simplifying assumptions often yield physically infeasible grasps that need to be filtered in high-fidelity simulators, significantly reducing the total number of grasps and their diversity. We propose a scalable generate-and-refine pipeline for synthesizing large-scale, diverse, and physically feasible grasps. Instead of using high-fidelity simulators solely for verification and filtering, we leverage them as an optimization stage that continuously improves grasp quality without discarding precomputed candidates. More specifically, we initialize an evolutionary search with a seed set of analytically generated, potentially suboptimal grasps. We then refine these proposals directly in a high-fidelity simulator (Isaac Sim) using an asynchronous, gradient-free evolutionary algorithm, improving stability while maintaining diversity. In addition, this refinement stage can be guided toward human preferences and/or domain-specific quality metrics without requiring a differentiable objective. We further distill the refined grasp distribution into a diffusion model for robust real-world deployment, and highlight the role of diversity for both effective training and during deployment. Experiments on a newly introduced Handles dataset and a DexGraspNet subset demonstrate that our approach achieves over 120 distinct stable grasps per object (a 1.7-6x improvement over unrefined analytical methods) while outperforming diffusion-based alternatives by 46-60\% in unique grasp coverage.
Chinese Translation
灵巧抓取是机器人技术的基础,但基于数据的抓取预测在很大程度上依赖于生成成本高昂且通常仅限于狭窄抓取器形态的数据集。分析性抓取合成可以用于扩展数据收集,但必要的简化假设往往会导致物理上不可行的抓取,这需要在高保真模拟器中进行过滤,从而显著减少抓取的总数及其多样性。我们提出了一种可扩展的生成与精炼管道,用于合成大规模、多样化和物理上可行的抓取。我们不仅将高保真模拟器用于验证和过滤,还将其作为一个优化阶段,持续改进抓取质量,而不丢弃预计算的候选抓取。更具体地说,我们以一组分析生成的、可能次优的抓取作为种子集初始化进化搜索。然后,我们使用异步、无梯度的进化算法在高保真模拟器(Isaac Sim)中直接精炼这些提案,提高稳定性,同时保持多样性。此外,这一精炼阶段可以在不需要可微分目标的情况下,朝向人类偏好和/或特定领域的质量指标进行引导。我们进一步将精炼后的抓取分布提炼为一个扩散模型,以便于在现实世界中的鲁棒部署,并强调多样性在有效训练和部署过程中的重要性。在新引入的Handles数据集和DexGraspNet子集上的实验表明,我们的方法每个物体实现了超过120个独特的稳定抓取(相比未精炼的分析方法提高了1.7-6倍),同时在独特抓取覆盖率上超越了基于扩散的替代方法46-60%。
cs.RO / 7 / 2602.15258
SEG-JPEG: Simple Visual Semantic Communications for Remote Operation of Automated Vehicles over Unreliable Wireless Networks
SEG-JPEG:用于不可靠无线网络下自动驾驶车辆远程操作的简单视觉语义通信
Abstract
Remote Operation is touted as being key to the rapid deployment of automated vehicles. Streaming imagery to control connected vehicles remotely currently requires a reliable, high throughput network connection, which can be limited in real-world remote operation deployments relying on public network infrastructure. This paper investigates how the application of computer vision assisted semantic communication can be used to circumvent data loss and corruption associated with traditional image compression techniques. By encoding the segmentations of detected road users into colour coded highlights within low resolution greyscale imagery, the required data rate can be reduced by 50 \% compared with conventional techniques, while maintaining visual clarity. This enables a median glass-to-glass latency of below 200ms even when the network data rate is below 500kbit/s, while clearly outlining salient road users to enhance situational awareness of the remote operator. The approach is demonstrated in an area of variable 4G mobile connectivity using an automated last-mile delivery vehicle. With this technique, the results indicate that large-scale deployment of remotely operated automated vehicles could be possible even on the often constrained public 4G/5G mobile network, providing the potential to expedite the nationwide roll-out of automated vehicles.
Chinese Translation
远程操作被认为是自动驾驶车辆快速部署的关键。目前,远程控制连接车辆所需的图像流传输需要可靠的高吞吐量网络连接,而在依赖公共网络基础设施的实际远程操作部署中,这种连接可能受到限制。本文研究了如何应用计算机视觉辅助的语义通信来规避传统图像压缩技术所带来的数据丢失和损坏。通过将检测到的道路使用者的分割信息编码为低分辨率灰度图像中的彩色高亮,所需的数据传输速率可以比传统技术降低50%,同时保持视觉清晰度。这使得即使在网络数据速率低于500kbit/s的情况下,玻璃到玻璃的中位延迟也可以低于200毫秒,同时清晰地勾勒出显著的道路使用者,以增强远程操作员的情境意识。该方法在4G移动连接不稳定的区域中使用自动化最后一公里配送车辆进行了演示。通过这种技术,结果表明,即使在常常受限的公共4G/5G移动网络上,远程操作的自动驾驶车辆的大规模部署也是可能的,从而有潜力加速自动驾驶车辆的全国推广。
cs.RO / 8 / 2602.15309
OSCAR: An Ovipositor-Inspired Self-Propelling Capsule Robot for Colonoscopy
OSCAR:一种受卵产器启发的自推进胶囊机器人用于结肠镜检查
Abstract
Self-propelling robotic capsules eliminate shaft looping of conventional colonoscopy, reducing patient discomfort. However, reliably moving within the slippery, viscoelastic environment of the colon remains a significant challenge. We present OSCAR, an ovipositor-inspired self-propelling capsule robot that translates the transport strategy of parasitic wasps into a propulsion mechanism for colonoscopy. OSCAR mechanically encodes the ovipositor-inspired motion pattern through a spring-loaded cam system that drives twelve circumferential sliders in a coordinated, phase-shifted sequence. By tuning the motion profile to maximize the retract phase relative to the advance phase, the capsule creates a controlled friction anisotropy at the interface that generates net forward thrust. We developed an analytical model incorporating a Kelvin-Voigt formulation to capture the viscoelastic stick--slip interactions between the sliders and the tissue, linking the asymmetry between advance and retract phase durations to mean thrust, and slider-reversal synchronization to thrust stability. Comprehensive force characterization experiments in ex-vivo porcine colon revealed a mean steady-state traction force of 0.85 N, closely matching the model. Furthermore, experiments confirmed that thrust generation is speed-independent and scales linearly with the phase asymmetry, in agreement with theoretical predictions, underscoring the capsule's predictable performance and scalability. In locomotion validation experiments, OSCAR demonstrated robust performance, achieving an average speed of 3.08 mm/s, a velocity sufficient to match the cecal intubation times of conventional colonoscopy. By coupling phase-encoded friction anisotropy with a predictive model, OSCAR delivers controllable thrust generation at low normal loads, enabling safer and more robust self-propelling locomotion for robotic capsule colonoscopy.
Chinese Translation
自推进机器人胶囊消除了传统结肠镜检查中的轴环绕现象,从而减少了患者的不适。然而,在结肠滑腻的粘弹性环境中可靠地移动仍然是一个重大挑战。我们提出了OSCAR,一种受卵产器启发的自推进胶囊机器人,将寄生蜂的运输策略转化为结肠镜检查的推进机制。OSCAR通过一个弹簧加载凸轮系统机械编码了受卵产器启发的运动模式,该系统以协调的相位偏移序列驱动十二个周向滑块。通过调整运动轮廓以最大化收回阶段相对于推进阶段的时间,胶囊在界面上产生可控的摩擦各向异性,从而产生净向前推力。我们开发了一个包含Kelvin-Voigt模型的分析模型,以捕捉滑块与组织之间的粘弹性粘滑相互作用,将推进阶段和收回阶段持续时间之间的不对称性与平均推力联系起来,并将滑块反转同步与推力稳定性关联。对离体猪结肠的全面力学特性实验显示,平均稳态牵引力为0.85 N,接近模型预测。此外,实验确认推力生成与速度无关,并与相位不对称性线性相关,这与理论预测一致,强调了胶囊的可预测性能和可扩展性。在运动验证实验中,OSCAR表现出强大的性能,平均速度达到3.08 mm/s,足以匹配传统结肠镜检查的盲肠插管时间。通过将相位编码的摩擦各向异性与预测模型相结合,OSCAR在低法向载荷下实现了可控的推力生成,从而为机器人胶囊结肠镜检查提供了更安全、更稳健的自推进运动。
cs.RO / 9 / 2602.15351
Feasibility-aware Imitation Learning from Observation with Multimodal Feedback
考虑可行性的基于观察的模仿学习与多模态反馈
Abstract
Imitation learning frameworks that learn robot control policies from demonstrators' motions via hand-mounted demonstration interfaces have attracted increasing attention. However, due to differences in physical characteristics between demonstrators and robots, this approach faces two limitations: i) the demonstration data do not include robot actions, and ii) the demonstrated motions may be infeasible for robots. These limitations make policy learning difficult. To address them, we propose Feasibility-Aware Behavior Cloning from Observation (FABCO). FABCO integrates behavior cloning from observation, which complements robot actions using robot dynamics models, with feasibility estimation. In feasibility estimation, the demonstrated motions are evaluated using a robot-dynamics model, learned from the robot's execution data, to assess reproducibility under the robot's dynamics. The estimated feasibility is used for multimodal feedback and feasibility-aware policy learning to improve the demonstrator's motions and learn robust policies. Multimodal feedback provides feasibility through the demonstrator's visual and haptic senses to promote feasible demonstrated motions. Feasibility-aware policy learning reduces the influence of demonstrated motions that are infeasible for robots, enabling the learning of policies that robots can execute stably. We conducted experiments with 15 participants on two tasks and confirmed that FABCO improves imitation learning performance by more than 3.2 times compared to the case without feasibility feedback.
Chinese Translation
通过手持演示接口从示范者的动作中学习机器人控制策略的模仿学习框架引起了越来越多的关注。然而,由于示范者与机器人之间在物理特性上的差异,这种方法面临两个限制:i) 演示数据不包括机器人的动作,ii) 演示的动作对于机器人而言可能不可行。这些限制使得策略学习变得困难。为了解决这些问题,我们提出了考虑可行性的基于观察的行为克隆(Feasibility-Aware Behavior Cloning from Observation,FABCO)。FABCO将基于观察的行为克隆与可行性评估相结合,利用机器人动力学模型补充机器人的动作。在可行性评估中,使用从机器人执行数据中学习的机器人动力学模型对演示的动作进行评估,以判断在机器人动力学下的可重现性。估计的可行性用于多模态反馈和考虑可行性的策略学习,以改善示范者的动作并学习稳健的策略。多模态反馈通过示范者的视觉和触觉感知提供可行性,以促进可行的演示动作。考虑可行性的策略学习减少了对机器人而言不可行的演示动作的影响,从而使得学习能够稳定执行的策略成为可能。我们对15名参与者进行了两项任务的实验,确认FABCO在模仿学习性能上比没有可行性反馈的情况提高了超过3.2倍。
cs.RO / 10 / 2602.15354
A Comparison of Bayesian Prediction Techniques for Mobile Robot Trajectory Tracking
移动机器人轨迹跟踪的贝叶斯预测技术比较
Abstract
This paper presents a performance comparison of different estimation and prediction techniques applied to the problem of tracking multiple robots. The main performance criteria are the magnitude of the estimation or prediction error, the computational effort and the robustness of each method to non-Gaussian noise. Among the different techniques compared are the well known Kalman filters and their different variants (e.g. extended and unscented), and the more recent techniques relying on Sequential Monte Carlo Sampling methods, such as particle filters and Gaussian Mixture Sigma Point Particle Filter.
Chinese Translation
本文对应用于多机器人跟踪问题的不同估计和预测技术进行了性能比较。主要的性能标准包括估计或预测误差的大小、计算工作量以及每种方法对非高斯噪声的鲁棒性。比较的不同技术包括众所周知的卡尔曼滤波器及其不同变体(如扩展卡尔曼滤波器和无迹卡尔曼滤波器),以及依赖于序列蒙特卡洛采样方法的较新技术,如粒子滤波器和高斯混合西格玛点粒子滤波器。
cs.RO / 11 / 2602.15357
Fluoroscopy-Constrained Magnetic Robot Control via Zernike-Based Field Modeling and Nonlinear MPC
基于Zernike多项式场建模和非线性模型预测控制的荧光透视约束磁性机器人控制
Abstract
Magnetic actuation enables surgical robots to navigate complex anatomical pathways while reducing tissue trauma and improving surgical precision. However, clinical deployment is limited by the challenges of controlling such systems under fluoroscopic imaging, which provides low frame rate and noisy pose feedback. This paper presents a control framework that remains accurate and stable under such conditions by combining a nonlinear model predictive control (NMPC) framework that directly outputs coil currents, an analytically differentiable magnetic field model based on Zernike polynomials, and a Kalman filter to estimate the robot state. Experimental validation is conducted with two magnetic robots in a 3D-printed fluid workspace and a spine phantom replicating drug delivery in the epidural space. Results show the proposed control method remains highly accurate when feedback is downsampled to 3 Hz with added Gaussian noise (sigma = 2 mm), mimicking clinical fluoroscopy. In the spine phantom experiments, the proposed method successfully executed a drug delivery trajectory with a root mean square (RMS) position error of 1.18 mm while maintaining safe clearance from critical anatomical boundaries.
Chinese Translation
磁性驱动使外科机器人能够在复杂的解剖路径中导航,同时减少组织创伤并提高外科精度。然而,由于荧光透视成像提供的低帧率和噪声姿态反馈,临床应用受到限制。本文提出了一种控制框架,通过结合直接输出线圈电流的非线性模型预测控制(NMPC)框架、基于Zernike多项式的解析可微磁场模型以及用于估计机器人状态的卡尔曼滤波器,在这种条件下保持准确性和稳定性。实验验证在一个3D打印的流体工作空间和一个脊柱模型中进行,模拟药物在硬膜外空间的输送。结果表明,所提出的控制方法在反馈降采样至3 Hz并添加高斯噪声(σ = 2 mm)时仍保持高度准确,模拟临床荧光透视。在脊柱模型实验中,所提出的方法成功执行了药物输送轨迹,均方根(RMS)位置误差为1.18 mm,同时保持与关键解剖边界的安全间距。
cs.RO / 12 / 2602.15397
ActionCodec: What Makes for Good Action Tokenizers
ActionCodec:优秀动作标记器的设计要素
Abstract
Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.
Chinese Translation
利用视觉-语言-动作(VLA)模型的原生自回归范式,视觉-语言模型(VLMs)在指令跟随和训练效率方面表现出色。该范式的核心是动作标记化,但其设计主要集中在重建保真度上,未能解决其对VLA优化的直接影响。因此, extit{优秀动作标记器的设计要素}这一基本问题仍未得到解答。本文通过从VLA优化的角度建立设计原则来填补这一空白。我们基于信息论的见解识别出一系列最佳实践,包括最大化时间标记重叠、最小化词汇冗余、增强多模态互信息和标记独立性。在这些原则的指导下,我们提出了 extbf{ActionCodec},一种高性能的动作标记器,显著提高了训练效率和VLA在各种模拟和现实世界基准测试中的表现。值得注意的是,在LIBERO上,使用ActionCodec微调的SmolVLM2-2.2B模型在没有任何机器人预训练的情况下达到了95.5\%的成功率。通过先进的架构增强,这一成功率提升至97.4\\%,代表了没有机器人预训练的VLA模型的新状态。我们相信,所建立的设计原则以及发布的模型将为社区提供清晰的路线图,以开发更有效的动作标记器。
cs.RO / 13 / 2602.15398
Hybrid F' and ROS2 Architecture for Vision-Based Autonomous Flight: Design and Experimental Validation
基于视觉的自主飞行的混合 F' 和 ROS2 架构:设计与实验验证
Abstract
Autonomous aerospace systems require architectures that balance deterministic real-time control with advanced perception capabilities. This paper presents an integrated system combining NASA's F' flight software framework with ROS2 middleware via Protocol Buffers bridging. We evaluate the architecture through a 32.25-minute indoor quadrotor flight test using vision-based navigation. The vision system achieved 87.19 Hz position estimation with 99.90\% data continuity and 11.47 ms mean latency, validating real-time performance requirements. All 15 ground commands executed successfully with 100 % success rate, demonstrating robust F'--PX4 integration. System resource utilization remained low (15.19 % CPU, 1,244 MB RAM) with zero stale telemetry messages, confirming efficient operation on embedded platforms. Results validate the feasibility of hybrid flight-software architectures combining certification-grade determinism with flexible autonomy for autonomous aerial vehicles.
Chinese Translation
自主航空系统需要在确定性实时控制与先进感知能力之间取得平衡。本文提出了一种集成系统,将 NASA 的 F' 飞行软件框架与 ROS2 中间件通过协议缓冲区(Protocol Buffers)进行桥接。我们通过一次 32.25 分钟的室内四旋翼飞行测试评估该架构,该测试采用基于视觉的导航。视觉系统实现了 87.19 Hz 的位置估计,数据连续性达到 99.90%,平均延迟为 11.47 毫秒,验证了实时性能要求。所有 15 个地面指令均成功执行,成功率为 100%,展示了 F' 与 PX4 的强健集成。系统资源利用率保持在低水平(CPU 15.19%,RAM 1,244 MB),且没有过时的遥测消息,确认了在嵌入式平台上的高效运行。结果验证了将认证级确定性与灵活自主相结合的混合飞行软件架构在自主飞行器中的可行性。
cs.RO / 14 / 2602.15400
One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation
一个代理引导他们所有人:通过明确的世界表示赋能多模态大型语言模型进行视觉与语言导航
Abstract
A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the produced actions. We conduct comprehensive experiments in both simulated and real-world environments. Our method establishes a new zero-shot state-of-the-art, achieving 48.8\% Success Rate (SR) in R2R-CE and 42.2\% in RxR-CE benchmarks. Furthermore, to validate the versatility of our metric representation, we demonstrate zero-shot sim-to-real transfer across diverse embodiments, including a wheeled TurtleBot 4 and a custom-built aerial drone. These real-world deployments verify that our decoupled framework serves as a robust, domain-invariant interface for embodied Vision-and-Language navigation.
Chinese Translation
一个可导航的代理需要理解高层次的语义指令和精确的空间感知。基于多模态大型语言模型(MLLMs)构建导航代理展示了一个有前景的解决方案,因为它们具有强大的泛化能力。然而,当前紧密耦合的设计显著限制了系统性能。在本研究中,我们提出了一种解耦设计,将低层次的空间状态估计与高层次的语义规划分开。与依赖于预定义的、过于简化的文本地图的先前方法不同,我们引入了一种交互式度量世界表示,保持丰富且一致的信息,使得MLLMs能够与其互动并进行推理以支持决策。此外,引入反事实推理进一步激发了MLLMs的能力,而度量世界表示确保了所产生动作的物理有效性。我们在模拟和现实环境中进行了全面实验。我们的方法建立了新的零样本最先进水平,在R2R-CE基准中实现了48.8%的成功率(SR),在RxR-CE基准中实现了42.2%。此外,为了验证我们度量表示的多样性,我们展示了在多种体现形式之间的零样本模拟到现实转移,包括轮式TurtleBot 4和定制的空中无人机。这些现实世界的部署验证了我们的解耦框架作为一个强健的、领域不变的接口,适用于具身视觉与语言导航。
cs.RO / 15 / 2602.15424
Lyapunov-Based $\mathcal{L}_2$-Stable PI-Like Control of a Four-Wheel Independently Driven and Steered Robot
基于李雅普诺夫的 $ ext{L}_2$ 稳定 PI 类控制的四轮独立驱动和转向机器人
Abstract
In this letter, Lyapunov-based synthesis of a PI-like controller is proposed for $\mathcal{L}_2$-stable motion control of an independently driven and steered four-wheel mobile robot. An explicit, structurally verified model is used to enable systematic controller design with stability and performance guarantees suitable for real-time operation. A Lyapunov function is constructed to yield explicit bounds and $\mathcal{L}_2$ stability results, supporting feedback synthesis that reduces configuration dependent effects. The resulting control law maintains a PI-like form suitable for standard embedded implementation while preserving rigorous stability properties. Effectiveness and robustness are demonstrated experimentally on a real four-wheel mobile robot platform.
Chinese Translation
在本文中,提出了一种基于李雅普诺夫的 PI 类控制器合成方法,用于独立驱动和转向的四轮移动机器人进行 $ ext{L}_2$ 稳定运动控制。采用一种显式且结构上经过验证的模型,以实现系统化的控制器设计,并确保适合实时操作的稳定性和性能保证。构建了一个李雅普诺夫函数,以提供显式界限和 $ ext{L}_2$ 稳定性结果,支持反馈合成,减少配置依赖效应。所得控制律保持了适合标准嵌入式实现的 PI 类形式,同时保留了严格的稳定性特性。通过在真实的四轮移动机器人平台上的实验,验证了该方法的有效性和鲁棒性。
cs.RO / 16 / 2602.15513
Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling
通过人类启发的记忆建模提升具身探索和问答中的多模态大语言模型
Abstract
Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.
Chinese Translation
将多模态大语言模型作为具身智能体的大脑仍然面临挑战,尤其是在长时间观察和有限上下文预算的情况下。现有的记忆辅助方法通常依赖于文本摘要,这会丢失丰富的视觉和空间细节,并且在非平稳环境中表现脆弱。在本研究中,我们提出了一种非参数记忆框架,明确区分具身探索和问答中的情节记忆与语义记忆。我们的检索优先、推理辅助的范式通过语义相似性回忆情节经验,并通过视觉推理进行验证,从而实现过去观察的稳健重用,而无需严格的几何对齐。同时,我们引入了一种程序式规则提取机制,将经验转化为结构化、可重用的语义记忆,促进跨环境的泛化。大量实验表明,在具身问答和探索基准测试中,我们的方法实现了最先进的性能,在 A-EQA 上 LLM-Match 提升了 7.3%,LLM MatchXSPL 提升了 11.4%,在 GOAT-Bench 上成功率提升了 7.7%,SPL 提升了 6.8%。分析表明,我们的情节记忆主要提高了探索效率,而语义记忆则增强了具身智能体的复杂推理能力。
cs.RO / 17 / 2602.15533
Efficient Knowledge Transfer for Jump-Starting Control Policy Learning of Multirotors through Physics-Aware Neural Architectures
通过物理感知神经架构高效知识转移以加速多旋翼控制策略学习
Abstract
Efficiently training control policies for robots is a major challenge that can greatly benefit from utilizing knowledge gained from training similar systems through cross-embodiment knowledge transfer. In this work, we focus on accelerating policy training using a library-based initialization scheme that enables effective knowledge transfer across multirotor configurations. By leveraging a physics-aware neural control architecture that combines a reinforcement learning-based controller and a supervised control allocation network, we enable the reuse of previously trained policies. To this end, we utilize a policy evaluation-based similarity measure that identifies suitable policies for initialization from a library. We demonstrate that this measure correlates with the reduction in environment interactions needed to reach target performance and is therefore suited for initialization. Extensive simulation and real-world experiments confirm that our control architecture achieves state-of-the-art control performance, and that our initialization scheme saves on average up to $73.5\%$ of environment interactions (compared to training a policy from scratch) across diverse quadrotor and hexarotor designs, paving the way for efficient cross-embodiment transfer in reinforcement learning.
Chinese Translation
高效训练机器人控制策略是一个主要挑战,利用通过跨体现知识转移从训练相似系统中获得的知识可以带来显著的好处。在本研究中,我们专注于使用基于库的初始化方案加速策略训练,该方案能够实现多旋翼配置之间的有效知识转移。通过利用一种物理感知的神经控制架构,该架构结合了基于强化学习的控制器和监督控制分配网络,我们实现了对先前训练策略的重用。为此,我们利用基于策略评估的相似性度量,从库中识别适合初始化的策略。我们证明了该度量与达到目标性能所需的环境交互减少之间存在相关性,因此适合用于初始化。广泛的仿真和实际实验确认我们的控制架构实现了最先进的控制性能,并且我们的初始化方案在不同的四旋翼和六旋翼设计中平均节省了高达73.5%的环境交互(与从头训练策略相比),为强化学习中的高效跨体现转移铺平了道路。
cs.RO / 18 / 2602.15543
Selective Perception for Robot: Task-Aware Attention in Multimodal VLA
机器人选择性感知:多模态视觉-语言-动作中的任务感知注意力
Abstract
In robotics, Vision-Language-Action (VLA) models that integrate diverse multimodal signals from multi-view inputs have emerged as an effective approach. However, most prior work adopts static fusion that processes all visual inputs uniformly, which incurs unnecessary computational overhead and allows task-irrelevant background information to act as noise. Inspired by the principles of human active perception, we propose a dynamic information fusion framework designed to maximize the efficiency and robustness of VLA models. Our approach introduces a lightweight adaptive routing architecture that analyzes the current text prompt and observations from a wrist-mounted camera in real-time to predict the task-relevance of multiple camera views. By conditionally attenuating computations for views with low informational utility and selectively providing only essential visual features to the policy network, Our framework achieves computation efficiency proportional to task relevance. Furthermore, to efficiently secure large-scale annotation data for router training, we established an automated labeling pipeline utilizing Vision-Language Models (VLMs) to minimize data collection and annotation costs. Experimental results in real-world robotic manipulation scenarios demonstrate that the proposed approach achieves significant improvements in both inference efficiency and control performance compared to existing VLA models, validating the effectiveness and practicality of dynamic information fusion in resource-constrained, real-time robot control environments.
Chinese Translation
在机器人技术中,视觉-语言-动作(VLA)模型通过整合来自多视角输入的多样化多模态信号,已成为一种有效的方法。然而,大多数先前的研究采用静态融合,统一处理所有视觉输入,这导致了不必要的计算开销,并允许与任务无关的背景信息充当噪声。受到人类主动感知原则的启发,我们提出了一种动态信息融合框架,旨在最大化VLA模型的效率和鲁棒性。我们的方法引入了一种轻量级自适应路由架构,实时分析当前文本提示和来自手腕-mounted相机的观察,以预测多个相机视角的任务相关性。通过有条件地减弱低信息效用视角的计算,并选择性地仅向策略网络提供必要的视觉特征,我们的框架实现了与任务相关性成比例的计算效率。此外,为了高效地获取大规模注释数据以进行路由器训练,我们建立了一个自动标注管道,利用视觉-语言模型(VLMs)来最小化数据收集和注释成本。在真实世界的机器人操作场景中的实验结果表明,与现有的VLA模型相比,所提出的方法在推理效率和控制性能上都取得了显著改善,验证了动态信息融合在资源受限的实时机器人控制环境中的有效性和实用性。
cs.RO / 19 / 2602.15549
VLM-DEWM: Dynamic External World Model for Verifiable and Resilient Vision-Language Planning in Manufacturing
VLM-DEWM:用于可验证和韧性视觉-语言规划的动态外部世界模型在制造中的应用
Abstract
Vision-language model (VLM) shows promise for high-level planning in smart manufacturing, yet their deployment in dynamic workcells faces two critical challenges: (1) stateless operation, they cannot persistently track out-of-view states, causing world-state drift; and (2) opaque reasoning, failures are difficult to diagnose, leading to costly blind retries. This paper presents VLM-DEWM, a cognitive architecture that decouples VLM reasoning from world-state management through a persistent, queryable Dynamic External World Model (DEWM). Each VLM decision is structured into an Externalizable Reasoning Trace (ERT), comprising action proposal, world belief, and causal assumption, which is validated against DEWM before execution. When failures occur, discrepancy analysis between predicted and observed states enables targeted recovery instead of global replanning. We evaluate VLM-DEWM on multi-station assembly, large-scale facility exploration, and real-robot recovery under induced failures. Compared to baseline memory-augmented VLM systems, VLM DEWM improves state-tracking accuracy from 56% to 93%, increases recovery success rate from below 5% to 95%, and significantly reduces computational overhead through structured memory. These results establish VLM-DEWM as a verifiable and resilient solution for long-horizon robotic operations in dynamic manufacturing environments.
Chinese Translation
视觉-语言模型(VLM)在智能制造中的高层次规划中展现出良好的前景,但其在动态工作单元中的应用面临两个关键挑战:(1)无状态操作,无法持续跟踪视野外的状态,导致世界状态漂移;(2)推理不透明,故障难以诊断,导致昂贵的盲目重试。本文提出了VLM-DEWM,一种认知架构,通过持久的、可查询的动态外部世界模型(DEWM)将VLM推理与世界状态管理解耦。每个VLM决策被结构化为一个可外部化的推理轨迹(ERT),包括行动提案、世界信念和因果假设,在执行前需与DEWM进行验证。当发生故障时,预测状态与观察状态之间的差异分析使得能够进行针对性的恢复,而不是全局重新规划。我们在多站点装配、大规模设施探索和在诱发故障下的真实机器人恢复中评估了VLM-DEWM。与基线的增强记忆VLM系统相比,VLM-DEWM将状态跟踪准确率从56%提高到93%,将恢复成功率从低于5%提升至95%,并通过结构化内存显著降低计算开销。这些结果确立了VLM-DEWM作为动态制造环境中长时间机器人操作的可验证和韧性解决方案。
cs.RO / 20 / 2602.15567
Constraining Streaming Flow Models for Adapting Learned Robot Trajectory Distributions
约束流动模型以适应学习的机器人轨迹分布
Abstract
Robot motion distributions often exhibit multi-modality and require flexible generative models for accurate representation. Streaming Flow Policies (SFPs) have recently emerged as a powerful paradigm for generating robot trajectories by integrating learned velocity fields directly in action space, enabling smooth and reactive control. However, existing formulations lack mechanisms for adapting trajectories post-training to enforce safety and task-specific constraints. We propose Constraint-Aware Streaming Flow (CASF), a framework that augments streaming flow policies with constraint-dependent metrics that reshape the learned velocity field during execution. CASF models each constraint, defined in either the robot's workspace or configuration space, as a differentiable distance function that is converted into a local metric and pulled back into the robot's control space. Far from restricted regions, the resulting metric reduces to the identity; near constraint boundaries, it smoothly attenuates or redirects motion, effectively deforming the underlying flow to maintain safety. This allows trajectories to be adapted in real time, ensuring that robot actions respect joint limits, avoid collisions, and remain within feasible workspaces, while preserving the multi-modal and reactive properties of streaming flow policies. We demonstrate CASF in simulated and real-world manipulation tasks, showing that it produces constraint-satisfying trajectories that remain smooth, feasible, and dynamically consistent, outperforming standard post-hoc projection baselines.
Chinese Translation
机器人运动分布通常表现出多模态特性,需要灵活的生成模型以实现准确表示。流动策略(Streaming Flow Policies, SFPs)最近作为一种强大的范式出现,通过直接在动作空间中整合学习到的速度场来生成机器人轨迹,从而实现平滑和反应式控制。然而,现有的公式缺乏在训练后适应轨迹的机制,以强制执行安全性和任务特定的约束。我们提出了约束感知流动(Constraint-Aware Streaming Flow, CASF),这是一个通过约束依赖度量增强流动策略的框架,在执行过程中重塑学习到的速度场。CASF将每个约束(无论是在机器人的工作空间还是配置空间中定义)建模为可微分的距离函数,该函数被转换为局部度量并拉回到机器人的控制空间。在远离受限区域时,结果度量简化为单位;在约束边界附近,它平滑地衰减或重定向运动,有效地变形基础流动以保持安全。这使得轨迹能够实时适应,确保机器人动作遵循关节限制,避免碰撞,并保持在可行的工作空间内,同时保留流动策略的多模态和反应特性。我们在模拟和现实世界的操作任务中演示了CASF,显示其产生满足约束的轨迹,这些轨迹保持平滑、可行且动态一致,优于标准的事后投影基线。
cs.RO / 21 / 2602.15608
Grip as Needed, Glide on Demand: Ultrasonic Lubrication for Robotic Locomotion
根据需要抓握,按需滑行:用于机器人运动的超声波润滑
Abstract
Friction is the essential mediator of terrestrial locomotion, yet in robotic systems it is almost always treated as a passive property fixed by surface materials and conditions. Here, we introduce ultrasonic lubrication as a method to actively control friction in robotic locomotion. By exciting resonant structures at ultrasonic frequencies, contact interfaces can dynamically switch between "grip" and "slip" states, enabling locomotion. We developed two friction control modules, a cylindrical design for lumen-like environments and a flat-plate design for external surfaces, and integrated them into bio-inspired systems modeled after inchworm and wasp ovipositor locomotion. Both systems achieved bidirectional locomotion with nearly perfect locomotion efficiencies that exceeded 90%. Friction characterization experiments further demonstrated substantial friction reduction across various surfaces, including rigid, soft, granular, and biological tissue interfaces, under dry and wet conditions, and on surfaces with different levels of roughness, confirming the broad applicability of ultrasonic lubrication to locomotion tasks. These findings establish ultrasonic lubrication as a viable active friction control mechanism for robotic locomotion, with the potential to reduce design complexity and improve efficiency of robotic locomotion systems.
Chinese Translation
摩擦是地面运动的基本介质,但在机器人系统中,它几乎总是被视为由表面材料和条件固定的被动属性。在此,我们引入超声波润滑作为一种主动控制机器人运动中摩擦的方法。通过在超声波频率下激励共振结构,接触界面可以动态切换“抓握”和“滑动”状态,从而实现运动。我们开发了两种摩擦控制模块,一种为适用于管腔环境的圆柱形设计,另一种为适用于外部表面的平板设计,并将它们集成到模仿尺蠖和黄蜂产卵器运动的仿生系统中。这两种系统实现了双向运动,运动效率几乎完美,超过90%。摩擦特性实验进一步证明了在干燥和湿润条件下,以及在不同粗糙度的表面上,包括刚性、柔软、颗粒状和生物组织界面,摩擦显著降低,确认了超声波润滑在运动任务中的广泛适用性。这些发现确立了超声波润滑作为一种可行的主动摩擦控制机制,用于机器人运动,具有降低设计复杂性和提高机器人运动系统效率的潜力。
cs.RO / 22 / 2602.15633
SpecFuse: A Spectral-Temporal Fusion Predictive Control Framework for UAV Landing on Oscillating Marine Platforms
SpecFuse:一种用于无人机在振荡海洋平台上着陆的光谱-时间融合预测控制框架
Abstract
Autonomous landing of Uncrewed Aerial Vehicles (UAVs) on oscillating marine platforms is severely constrained by wave-induced multi-frequency oscillations, wind disturbances, and prediction phase lags in motion prediction. Existing methods either treat platform motion as a general random process or lack explicit modeling of wave spectral characteristics, leading to suboptimal performance under dynamic sea conditions. To address these limitations, we propose SpecFuse: a novel spectral-temporal fusion predictive control framework that integrates frequency-domain wave decomposition with time-domain recursive state estimation for high-precision 6-DoF motion forecasting of Uncrewed Surface Vehicles (USVs). The framework explicitly models dominant wave harmonics to mitigate phase lags, refining predictions in real time via IMU data without relying on complex calibration. Additionally, we design a hierarchical control architecture featuring a sampling-based HPO-RRT* algorithm for dynamic trajectory planning under non-convex constraints and a learning-augmented predictive controller that fuses data-driven disturbance compensation with optimization-based execution. Extensive validations (2,000 simulations + 8 lake experiments) show our approach achieves a 3.2 cm prediction error, 4.46 cm landing deviation, 98.7% / 87.5% success rates (simulation / real-world), and 82 ms latency on embedded hardware, outperforming state-of-the-art methods by 44%-48% in accuracy. Its robustness to wave-wind coupling disturbances supports critical maritime missions such as search and rescue and environmental monitoring. All code, experimental configurations, and datasets will be released as open-source to facilitate reproducibility.
Chinese Translation
无人驾驶飞行器(UAV)在振荡海洋平台上的自主着陆受到波浪引起的多频振荡、风干扰和运动预测中的预测相位滞后等因素的严重制约。现有方法要么将平台运动视为一般随机过程,要么缺乏对波浪光谱特征的明确建模,导致在动态海洋条件下的性能不佳。为了解决这些局限性,我们提出了SpecFuse:一种新颖的光谱-时间融合预测控制框架,该框架将频域波浪分解与时域递归状态估计相结合,以实现无人水面载具(USV)的高精度6自由度运动预测。该框架明确建模主要波浪谐波,以减轻相位滞后,通过IMU数据实时优化预测,而无需依赖复杂的校准。此外,我们设计了一个分层控制架构,采用基于采样的HPO-RRT*算法进行非凸约束下的动态轨迹规划,并结合数据驱动的干扰补偿与基于优化的执行的学习增强型预测控制器。大量验证(2000次仿真 + 8次湖泊实验)表明,我们的方法在嵌入式硬件上实现了3.2厘米的预测误差、4.46厘米的着陆偏差、98.7% / 87.5%的成功率(仿真 / 现实世界)以及82毫秒的延迟,准确性比最先进的方法提高了44%-48%。其对波浪-风耦合干扰的鲁棒性支持了关键的海洋任务,如搜救和环境监测。所有代码、实验配置和数据集将作为开源发布,以促进可重复性。
cs.RO / 23 / 2602.15642
Spatially-Aware Adaptive Trajectory Optimization with Controller-Guided Feedback for Autonomous Racing
具有控制器引导反馈的空间感知自适应轨迹优化用于自主赛车
Abstract
We present a closed-loop framework for autonomous raceline optimization that combines NURBS-based trajectory representation, CMA-ES global trajectory optimization, and controller-guided spatial feedback. Instead of treating tracking errors as transient disturbances, our method exploits them as informative signals of local track characteristics via a Kalman-inspired spatial update. This enables the construction of an adaptive, acceleration-based constraint map that iteratively refines trajectories toward near-optimal performance under spatially varying track and vehicle behavior. In simulation, our approach achieves a 17.38% lap time reduction compared to a controller parametrized with maximum static acceleration. On real hardware, tested with different tire compounds ranging from high to low friction, we obtain a 7.60% lap time improvement without explicitly parametrizing friction. This demonstrates robustness to changing grip conditions in real-world scenarios.
Chinese Translation
我们提出了一种闭环框架,用于自主赛车线路优化,该框架结合了基于NURBS的轨迹表示、CMA-ES全局轨迹优化和控制器引导的空间反馈。我们的方法并不将跟踪误差视为瞬态干扰,而是通过受卡尔曼滤波启发的空间更新将其作为局部赛道特征的信息信号。这使得构建一个自适应的基于加速度的约束图成为可能,该约束图能够迭代地优化轨迹,以实现空间变化的赛道和车辆行为下的近似最佳性能。在仿真中,我们的方法与参数化为最大静态加速度的控制器相比,实现了17.38%的圈速提升。在真实硬件上,使用不同的轮胎成分进行测试,从高摩擦到低摩擦,我们在没有明确参数化摩擦的情况下获得了7.60%的圈速改善。这证明了我们的方法在现实场景中对抓地条件变化的鲁棒性。
cs.RO / 24 / 2602.15684
Estimating Human Muscular Fatigue in Dynamic Collaborative Robotic Tasks with Learning-Based Models
基于学习模型的动态协作机器人任务中人类肌肉疲劳的估计
Abstract
Assessing human muscle fatigue is critical for optimizing performance and safety in physical human-robot interaction(pHRI). This work presents a data-driven framework to estimate fatigue in dynamic, cyclic pHRI using arm-mounted surface electromyography(sEMG). Subject-specific machine-learning regression models(Random Forest, XGBoost, and Linear Regression predict the fraction of cycles to fatigue(FCF) from three frequency-domain and one time-domain EMG features, and are benchmarked against a convolutional neural network(CNN) that ingests spectrograms of filtered EMG. Framing fatigue estimation as regression (rather than classification) captures continuous progression toward fatigue, supporting earlier detection, timely intervention, and adaptive robot control. In experiments with ten participants, a collaborative robot under admittance control guided repetitive lateral (left-right) end-effector motions until muscular fatigue. Average FCF RMSE across participants was 20.8+/-4.3% for the CNN, 23.3+/-3.8% for Random Forest, 24.8+/-4.5% for XGBoost, and 26.9+/-6.1% for Linear Regression. To probe cross-task generalization, one participant additionally performed unseen vertical (up-down) and circular repetitions; models trained only on lateral data were tested directly and largely retained accuracy, indicating robustness to changes in movement direction, arm kinematics, and muscle recruitment, while Linear Regression deteriorated. Overall, the study shows that both feature-based ML and spectrogram-based DL can estimate remaining work capacity during repetitive pHRI, with the CNN delivering the lowest error and the tree-based models close behind. The reported transfer to new motion patterns suggests potential for practical fatigue monitoring without retraining for every task, improving operator protection and enabling fatigue-aware shared autonomy, for safer fatigue-adaptive pHRI control.
Chinese Translation
评估人类肌肉疲劳对于优化物理人机交互(pHRI)的性能和安全至关重要。本研究提出了一种数据驱动框架,通过臂部表面肌电图(sEMG)估计动态循环pHRI中的疲劳。针对特定受试者的机器学习回归模型(随机森林、XGBoost和线性回归)利用三个频域和一个时域的肌电特征预测疲劳周期占比(FCF),并与输入滤波肌电图谱的卷积神经网络(CNN)进行了基准比较。将疲劳估计框架设定为回归(而非分类)能够捕捉疲劳的连续进展,支持更早的检测、及时的干预和自适应机器人控制。在十名参与者的实验中,一台在准入控制下的协作机器人引导重复的横向(左右)末端执行器运动,直到肌肉疲劳。参与者的平均FCF均方根误差(RMSE)为CNN 20.8±4.3%、随机森林23.3±3.8%、XGBoost 24.8±4.5%和线性回归26.9±6.1%。为了探讨跨任务的泛化能力,一名参与者还进行了未见的垂直(上下)和圆形重复运动;仅在横向数据上训练的模型被直接测试,且大部分保持了准确性,表明对运动方向、手臂运动学和肌肉招募变化的鲁棒性,而线性回归则表现不佳。总体而言,研究表明基于特征的机器学习和基于谱图的深度学习均可以在重复的pHRI中估计剩余工作能力,其中CNN提供了最低的误差,而树模型紧随其后。报告的新运动模式转移表明,可能在不为每个任务重新训练的情况下实现实用的疲劳监测,从而提高操作员保护并实现疲劳感知的共享自主性,以更安全的疲劳自适应pHRI控制。
cs.RO / 25 / 2602.15721
Lifelong Scalable Multi-Agent Realistic Testbed and A Comprehensive Study on Design Choices in Lifelong AGV Fleet Management Systems
可扩展的终身多智能体真实测试平台及终身AGV车队管理系统设计选择的综合研究
Abstract
We present Lifelong Scalable Multi-Agent Realistic Testbed (LSMART), an open-source simulator to evaluate any Multi-Agent Path Finding (MAPF) algorithm in a Fleet Management System (FMS) with Automated Guided Vehicles (AGVs). MAPF aims to move a group of agents from their corresponding starting locations to their goals. Lifelong MAPF (LMAPF) is a variant of MAPF that continuously assigns new goals for agents to reach. LMAPF applications, such as autonomous warehouses, often require a centralized, lifelong system to coordinate the movement of a fleet of robots, typically AGVs. However, existing works on MAPF and LMAPF often assume simplified kinodynamic models, such as pebble motion, as well as perfect execution and communication for AGVs. Prior work has presented SMART, a software capable of evaluating any MAPF algorithms while considering agent kinodynamics, communication delays, and execution uncertainties. However, SMART is designed for MAPF, not LMAPF. Generalizing SMART to an FMS requires many more design choices. First, an FMS parallelizes planning and execution, raising the question of when to plan. Second, given planners with varying optimality and differing agent-model assumptions, one must decide how to plan. Third, when the planner fails to return valid solutions, the system must determine how to recover. In this paper, we first present LSMART, an open-source simulator that incorporates all these considerations to evaluate any MAPF algorithms in an FMS. We then provide experiment results based on state-of-the-art methods for each design choice, offering guidance on how to effectively design centralized lifelong AGV Fleet Management Systems. LSMART is available at https://smart-mapf.github.io/lifelong-smart.
Chinese Translation
我们提出了可扩展的终身多智能体真实测试平台(Lifelong Scalable Multi-Agent Realistic Testbed, LSMART),这是一个开源模拟器,用于评估任何多智能体路径寻找(Multi-Agent Path Finding, MAPF)算法在自动导引车(Automated Guided Vehicles, AGVs)车队管理系统(Fleet Management System, FMS)中的表现。MAPF旨在将一组智能体从其对应的起始位置移动到目标位置。终身MAPF(Lifelong MAPF, LMAPF)是MAPF的一种变体,它持续为智能体分配新的目标。LMAPF的应用,如自主仓库,通常需要一个集中式的终身系统来协调一组机器人的移动,通常是AGVs。然而,现有的MAPF和LMAPF研究往往假设简化的运动动力学模型,如小石子运动,以及AGVs的完美执行和通信。之前的研究提出了SMART,一种能够评估任何MAPF算法的软件,同时考虑智能体的运动动力学、通信延迟和执行不确定性。然而,SMART是为MAPF设计的,而不是LMAPF。将SMART推广到FMS需要更多的设计选择。首先,FMS实现了规划与执行的并行化,这引发了何时规划的问题。其次,考虑到具有不同最优性和不同智能体模型假设的规划者,必须决定如何进行规划。第三,当规划者未能返回有效解决方案时,系统必须确定如何恢复。在本文中,我们首先介绍了LSMART,一个开源模拟器,结合了所有这些考虑因素,以评估任何MAPF算法在FMS中的表现。然后,我们提供了基于每个设计选择的最先进方法的实验结果,为如何有效设计集中式终身AGV车队管理系统提供指导。LSMART可在 https://smart-mapf.github.io/lifelong-smart 获取。
cs.RO / 26 / 2602.15733
MeshMimic: Geometry-Aware Humanoid Motion Learning through 3D Scene Reconstruction
MeshMimic:通过3D场景重建实现几何感知的人形运动学习
Zhang, Qiang, Ma, Jiahao, Liu, Peiran, Shi, Shuai, Su, Zeran, Wang, Zifan, Sun, Jingkai, Cui, Wei, Yu, Jialin, Han, Gang, Zhao, Wen, Sun, Pihai, Yin, Kangning, Wang, Jiaxu, Cao, Jiahang, Zhang, Lingfeng, Cheng, Hao, Hao, Xiaoshuai, Ji, Yiding, Liang, Junwei, Tang, Jian, Xu, Renjing, Guo, Yijie
Abstract
Humanoid motion control has witnessed significant breakthroughs in recent years, with deep reinforcement learning (RL) emerging as a primary catalyst for achieving complex, human-like behaviors. However, the high dimensionality and intricate dynamics of humanoid robots make manual motion design impractical, leading to a heavy reliance on expensive motion capture (MoCap) data. These datasets are not only costly to acquire but also frequently lack the necessary geometric context of the surrounding physical environment. Consequently, existing motion synthesis frameworks often suffer from a decoupling of motion and scene, resulting in physical inconsistencies such as contact slippage or mesh penetration during terrain-aware tasks. In this work, we present MeshMimic, an innovative framework that bridges 3D scene reconstruction and embodied intelligence to enable humanoid robots to learn coupled "motion-terrain" interactions directly from video. By leveraging state-of-the-art 3D vision models, our framework precisely segments and reconstructs both human trajectories and the underlying 3D geometry of terrains and objects. We introduce an optimization algorithm based on kinematic consistency to extract high-quality motion data from noisy visual reconstructions, alongside a contact-invariant retargeting method that transfers human-environment interaction features to the humanoid agent. Experimental results demonstrate that MeshMimic achieves robust, highly dynamic performance across diverse and challenging terrains. Our approach proves that a low-cost pipeline utilizing only consumer-grade monocular sensors can facilitate the training of complex physical interactions, offering a scalable path toward the autonomous evolution of humanoid robots in unstructured environments.
Chinese Translation
近年来,人形运动控制取得了显著突破,深度强化学习(RL)成为实现复杂人类行为的主要催化剂。然而,人形机器人的高维度和复杂动态使得手动运动设计变得不切实际,导致对昂贵的运动捕捉(MoCap)数据的高度依赖。这些数据集不仅获取成本高,而且常常缺乏周围物理环境所需的几何上下文。因此,现有的运动合成框架往往面临运动与场景的解耦,导致在地形感知任务中出现接触滑移或网格穿透等物理不一致性。在本研究中,我们提出了MeshMimic,这一创新框架将3D场景重建与具身智能相结合,使人形机器人能够直接从视频中学习耦合的“运动-地形”交互。通过利用最先进的3D视觉模型,我们的框架精确地分割和重建人类轨迹以及地形和物体的基础3D几何结构。我们引入了一种基于运动学一致性的优化算法,从噪声视觉重建中提取高质量的运动数据,并提出了一种接触不变的重定向方法,将人类与环境交互特征转移到人形代理上。实验结果表明,MeshMimic在多样且具有挑战性的地形上实现了稳健且高度动态的性能。我们的方法证明,仅利用消费级单目传感器的低成本管道可以促进复杂物理交互的训练,为人形机器人在非结构化环境中的自主演化提供了一条可扩展的路径。
cs.RO / 27 / 2602.15767
Robot-Assisted Social Dining as a White Glove Service
机器人辅助社交用餐作为一种白手套服务
Abstract
Robot-assisted feeding enables people with disabilities who require assistance eating to enjoy a meal independently and with dignity. However, existing systems have only been tested in-lab or in-home, leaving in-the-wild social dining contexts (e.g., restaurants) largely unexplored. Designing a robot for such contexts presents unique challenges, such as dynamic and unsupervised dining environments that a robot needs to account for and respond to. Through speculative participatory design with people with disabilities, supported by semi-structured interviews and a custom AI-based visual storyboarding tool, we uncovered ideal scenarios for in-the-wild social dining. Our key insight suggests that such systems should: embody the principles of a white glove service where the robot (1) supports multimodal inputs and unobtrusive outputs; (2) has contextually sensitive social behavior and prioritizes the user; (3) has expanded roles beyond feeding; (4) adapts to other relationships at the dining table. Our work has implications for in-the-wild and group contexts of robot-assisted feeding.
Chinese Translation
机器人辅助喂食使需要帮助进食的残疾人士能够独立且有尊严地享用餐食。然而,现有系统仅在实验室或家庭环境中进行测试,尚未在真实社交用餐场景(例如餐厅)中进行探索。在此类环境中设计机器人面临独特挑战,例如动态和无监督的用餐环境,机器人需要考虑并做出响应。通过与残疾人士进行投机性参与设计,辅以半结构化访谈和定制的基于人工智能的视觉故事板工具,我们揭示了真实社交用餐的理想场景。我们的关键见解表明,这类系统应当:体现白手套服务的原则,其中机器人(1)支持多模态输入和不干扰的输出;(2)具有上下文敏感的社交行为并优先考虑用户;(3)扩展角色超越喂食;(4)适应餐桌上的其他关系。我们的研究对机器人辅助喂食的真实和群体环境具有重要意义。
cs.RO / 28 / 2602.15813
FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy
FAST-EQA:具有全局和局部区域相关性的高效具身问答
Abstract
Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high-value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent's attention, improve scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS-Bench, FAST-EQA achieves state-of-the-art performance, while performing competitively on OpenEQA and MT-HM3D.
Chinese Translation
具身问答(EQA)结合了视觉场景理解、目标导向探索、空间和时间推理,且在部分可观察性下进行。一个核心挑战是将物理搜索限制在与问题相关的子空间内,同时保持对观察结果的紧凑、可操作的记忆。此外,在实际部署中,探索过程中的快速推理时间至关重要。我们提出了FAST-EQA,一个基于问题条件的框架,它(i)识别可能的视觉目标,(ii)对全球感兴趣区域进行评分以指导导航,以及(iii)在视觉记忆上采用思维链(Chain-of-Thought, CoT)推理以自信地回答。FAST-EQA维持一个有限的场景记忆,存储固定容量的区域目标假设集并在线更新,从而能够稳健地处理单一和多目标问题,而不会出现无限增长。为了有效扩展覆盖范围,全球探索策略将狭窄开口和门视为高价值前沿,补充了局部目标寻求,且计算量最小。这些组件共同聚焦于代理的注意力,提高场景覆盖率,并提升答案的可靠性,同时运行速度显著快于先前的方法。在HMEQA和EXPRESS-Bench上,FAST-EQA达到了最先进的性能,同时在OpenEQA和MT-HM3D上表现出竞争力。
cs.RO / 29 / 2602.15827
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
感知型类人跑酷:通过运动匹配链式组合动态人类技能
Abstract
While recent advances in humanoid locomotion have achieved stable walking on varied terrains, capturing the agility and adaptivity of highly dynamic human motions remains an open challenge. In particular, agile parkour in complex environments demands not only low-level robustness, but also human-like motion expressiveness, long-horizon skill composition, and perception-driven decision-making. In this paper, we present Perceptive Humanoid Parkour (PHP), a modular framework that enables humanoid robots to autonomously perform long-horizon, vision-based parkour across challenging obstacle courses. Our approach first leverages motion matching, formulated as nearest-neighbor search in a feature space, to compose retargeted atomic human skills into long-horizon kinematic trajectories. This framework enables the flexible composition and smooth transition of complex skill chains while preserving the elegance and fluidity of dynamic human motions. Next, we train motion-tracking reinforcement learning (RL) expert policies for these composed motions, and distill them into a single depth-based, multi-skill student policy, using a combination of DAgger and RL. Crucially, the combination of perception and skill composition enables autonomous, context-aware decision-making: using only onboard depth sensing and a discrete 2D velocity command, the robot selects and executes whether to step over, climb onto, vault or roll off obstacles of varying geometries and heights. We validate our framework with extensive real-world experiments on a Unitree G1 humanoid robot, demonstrating highly dynamic parkour skills such as climbing tall obstacles up to 1.25m (96% robot height), as well as long-horizon multi-obstacle traversal with closed-loop adaptation to real-time obstacle perturbations.
Chinese Translation
尽管近年来类人运动的进展已实现了在多种地形上稳定行走,但捕捉高度动态人类动作的灵活性和适应性仍然是一个开放的挑战。尤其是在复杂环境中的灵活跑酷,不仅需要低层次的鲁棒性,还需要类人运动的表现力、长时间技能组合和基于感知的决策能力。本文提出了感知型类人跑酷(Perceptive Humanoid Parkour,PHP),这是一个模块化框架,使类人机器人能够自主地在具有挑战性的障碍课程中执行基于视觉的长时间跑酷。我们的方法首先利用运动匹配,将其表述为特征空间中的最近邻搜索,以将重新定向的原子人类技能组合成长时间的运动轨迹。该框架能够灵活组合复杂的技能链并实现平滑过渡,同时保持动态人类动作的优雅和流畅性。接下来,我们为这些组合动作训练运动跟踪强化学习(RL)专家策略,并使用DAgger和RL的结合将其提炼为单一的基于深度的多技能学生策略。关键是,感知与技能组合的结合使得自主的、上下文感知的决策成为可能:机器人仅使用机载深度传感器和离散的二维速度命令,选择并执行跨越、攀爬、翻越或滚落不同几何形状和高度的障碍物。我们通过在Unitree G1类人机器人上进行广泛的真实世界实验验证了我们的框架,展示了高度动态的跑酷技能,例如攀爬高达1.25米(占机器人高度的96%)的障碍物,以及对实时障碍扰动的闭环适应的长时间多障碍物穿越。
cs.RO / 30 / 2602.15828
Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
Dex4D:任务无关的点轨迹策略用于仿真到现实的灵巧操控
Abstract
Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.
Chinese Translation
学习能够完成众多日常任务的通用策略仍然是灵巧操控中的一个开放挑战。特别是,通过真实世界的遥操作收集大规模操控数据既昂贵又难以扩展。虽然在仿真中学习提供了一种可行的替代方案,但设计多个特定任务的环境和奖励以进行训练同样具有挑战性。我们提出了Dex4D,一个框架,它利用仿真来学习任务无关的灵巧技能,这些技能可以灵活地重新组合以执行多样的现实世界操控任务。具体而言,Dex4D学习了一种领域无关的3D点轨迹条件策略,能够将任何物体操控到任何所需姿态。我们在仿真中训练这种“任意姿态到任意姿态”的策略,涵盖数千个具有不同姿态配置的物体,覆盖了广泛的机器人-物体交互空间,这些交互可以在测试时组合。在部署时,该策略可以在不进行微调的情况下零-shot转移到现实世界任务,只需通过从生成的视频中提取的期望物体中心点轨迹进行提示。在执行过程中,Dex4D使用在线点跟踪进行闭环感知和控制。在仿真和真实机器人上的大量实验表明,我们的方法使得多样的灵巧操控任务的零-shot部署成为可能,并在先前基准上取得了一致的改进。此外,我们展示了对新物体、场景布局、背景和轨迹的强泛化能力,突显了所提框架的鲁棒性和可扩展性。