Zhang, Tianle, Yuan, Zhihao, Chi, Dafeng, Liu, Peidong, Li, Dongwei, Hu, Kejun, Zhang, Likui, Nie, Junnan, Wei, Ziming, Chen, Zengjue, Tang, Yili, Li, Jiayi, Xiang, Zhiyuan, Li, Mingyang, Luo, Tianci, Wan, Hanwen, Li, Ao, Zhai, Linbo, Zhan, Zhihao, Zhuang, Yuzheng, Lin, Liang, Bai, Xiaodong, Cai, Jiakun, Cao, Peng, Chen, Kangliang, Chen, Siang, Dai, Yixiang, Di, Shuai, Duan, Nan, Gong, Yicheng, Gui, Chenguang, Guo, Yucheng, Hao, Peng, He, Qingrong, Huang, Haoyang, Huang, Kunrui, Huang, Zhixuan, Jin, Shibo, Jin, Yixiang, Li, Anson, Li, Dongjiang, Li, Jiawei, Li, Ruodai, Li, Yihang, Li, Yuzhen, Liang, Jiaming, Liu, Fangsheng, Long, Jing, Luo, Mingxi, Pan, Xing, Shen, Hui, Tian, Xiaomeng, Wang, Daming, Wang, Song, Xiong, Junwu, Xu, Hang, Xu, Wanting, Yu, Zhengcheng, Zhang, He, Zhang, Jiyao, Zhao, Lin, Zhou, Chen
Abstract
Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.
Chinese Translation
开放世界环境中的机器人自主性在根本上受到数据多样性不足和跨体现泛化能力差的限制。现有的机器人数据集通常在规模和任务覆盖范围上有限,而机器人体现之间的较大差异则妨碍了有效的行为知识转移。为了解决这些挑战,我们提出了JoyAI-RA,这是一种针对可泛化机器人操作的视觉-语言-动作(VLA)体现基础模型。JoyAI-RA提出了一种多源多层次的预训练框架,整合了网络数据、大规模自我中心人类操作视频、模拟生成的轨迹以及真实机器人数据。通过在异构多源数据上进行训练,并进行明确的动作空间统一,JoyAI-RA有效地弥合了体现之间的差距,特别是在人体操作与机器人控制之间,从而增强了跨体现的行为学习。JoyAI-RA在模拟和现实世界基准测试中均优于最先进的方法,尤其是在具有泛化需求的多样化任务上表现突出。