Kim, Dongyoung, Jang, Huiwon, Koo, Myungkyu, Jang, Suhyeok, Kim, Taeyoung, Kim, Beomjun, Yoon, Byungjun, Jang, Changsung, Choi, Daewon, Han, Dongsu, Lee, Donguk, Kwon, Heeseung, Jeon, Hojin, Kang, Jaehyun, Bae, Jaekyoung, Lee, Jihyuk, Lee, Jimin, Won, John, Ahn, Joonwoo, Park, Junhyeong, Sung, Junyoung, Lee, Kyungmin, Han, Minseong, Yoon, Minsung, Joo, Sejune, Son, Seonil, Park, Seungcheol, Cho, Seunggeun, Moon, Seungjun, Kim, Seungku, Dong, Yonghoon, Cho, Yongjin, Kim, Youngchan, Kim, Chang Hwan, Kim, Dohyeon, Lee, Hazel, Kim, Heecheol, Ahn, Hensen, Ryu, Hyungkyu, Choi, Hyunsoo, Shin, Hyunsoo, Jung, Jaeheon, Kim, Jaewoo, Kim, Jinwook, Chang, Joochul, Kim, Joonsoo, Park, Junghun, Park, Jungwoo, Cho, Junho, Park, Junhyeok, Lee, Junwon, Lee, Kangwook, Kim, Kwanghoon, Choe, Kyoungwhan, Bhadu, Manoj, Oh, Nayoung, Kim, Sangjun, Kim, Sangwoo, Shim, Seunghoon, Kim, Seunghyun, Lee, Seungjun, Ka, Seungyup, Yang, Sungryol, Jung, Wook, Shukla, Yashu, Lee, Yeonjae, Bae, Yeonwoo, Shin, Jinwoo
Abstract
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $\pi_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $\pi_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.
Chinese Translation
尽管视觉-语言-动作模型(Vision-Language-Action models, VLA)在通过预训练的视觉-语言模型所继承的多样化智能(即广泛的场景理解和语言条件下的泛化)方面取得了显著进展,但它们在面对需要更广泛功能能力的复杂现实任务时仍然面临挑战(例如,运动感知、记忆感知的决策制定和物理感知)。为了解决这一问题,我们提出了 RLDX-1,这是一种基于多流动作变换器(Multi-Stream Action Transformer, MSAT)构建的通用机器人策略,用于灵巧操作。该架构通过集成异质模态,结合了模态特定流与跨模态联合自注意力,统一了这些能力。RLDX-1 进一步结合了系统级设计选择,包括为稀有操作场景合成训练数据、针对类人操作的学习过程以及实时部署的推理优化。通过实证评估,我们表明 RLDX-1 在需要超出一般通用性的广泛功能能力的模拟基准和现实任务中,持续超越了近期的前沿 VLA(例如,$ ext{π}_{0.5}$ 和 GR00T N1.6)。特别是在 ALLEX 类人任务中,RLDX-1 以 86.8% 的成功率展现了优越性,而 $ ext{π}_{0.5}$ 和 GR00T N1.6 的成功率约为 40%,突显了 RLDX-1 在多样化功能需求下控制高自由度(high-DoF)类人机器人的能力。这些结果共同将 RLDX-1 定位为朝向可靠的 VLA 在复杂、高接触和动态现实世界灵巧操作又一 promising 进展。