Ye, Seonghyeon, Ge, Yunhao, Zheng, Kaiyuan, Gao, Shenyuan, Yu, Sihyun, Kurian, George, Indupuru, Suneel, Tan, You Liang, Zhu, Chuning, Xiang, Jiannan, Malik, Ayaan, Lee, Kyungmin, Liang, William, Ranawaka, Nadun, Gu, Jiasheng, Xu, Yinzhen, Wang, Guanzhi, Hu, Fengyuan, Narayan, Avnish, Bjorck, Johan, Wang, Jing, Kim, Gwanghyun, Niu, Dantong, Zheng, Ruijie, Xie, Yuqi, Wu, Jimmy, Wang, Qi, Julian, Ryan, Xu, Danfei, Du, Yilun, Chebotar, Yevgen, Reed, Scott, Kautz, Jan, Zhu, Yuke, Fan, Linxi "Jim", Jang, Joel
Abstract
State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.
Chinese Translation
最先进的视觉-语言-动作(VLA)模型在语义泛化方面表现出色,但在新环境中对未见物理动作的泛化能力较弱。我们提出了DreamZero,一个基于预训练视频扩散骨干网构建的世界动作模型(WAM)。与VLA不同,WAM通过预测未来的世界状态和动作来学习物理动态,利用视频作为世界演变的密集表示。通过联合建模视频和动作,DreamZero能够有效地从异构机器人数据中学习多样化的技能,而无需依赖重复的演示。这使得在真实机器人实验中,相较于最先进的VLA,DreamZero在新任务和新环境的泛化能力提高了超过2倍。关键是,通过模型和系统的优化,我们使得一个140亿参数的自回归视频扩散模型能够以7Hz的频率进行实时闭环控制。最后,我们展示了两种形式的跨体现迁移:来自其他机器人或人类的视频演示在未见任务表现上相较于基线提高了超过42%,仅需10-20分钟的数据。更令人惊讶的是,DreamZero实现了少样本体现适应,仅需30分钟的游戏数据即可迁移到新的体现,同时保持零样本泛化能力。