Jia, Yufei, Zhang, Heng, Zhang, Ziheng, Wu, Junzhe, Yu, Mingrui, Wang, Zifan, Jiang, Dixuan, Li, Zheng, Cao, Chenyu, Yu, Zhuoyuan, Yang, Xun, Ge, Haizhou, Zhang, Yuchi, Zhang, Jiayuan, Huang, Zhenbiao, Liu, Tianle, Chen, Shenyu, Wang, Jiacheng, Xie, Bin, Yao, Xuran, Deng, Xiwa, Wang, Guangyu, Zhang, Jinzhi, Hao, Lei, Chen, Zhixing, Chen, Yuxiang, Wang, Anqi, Tian, Hongyun, Yan, Yiyi, Cao, Zhanxiang, Jiang, Yizhou, Shao, Hanyang, Li, Yue, Shi, Lu, Chen, Bokui, Sui, Wei, Cui, Hanqing, Qin, Yusen, Huang, Ruqi, Han, Lei, Wang, Tiancai, Zhou, Guyue
Abstract
Embodied AI research is undergoing a shift toward vision-centric perceptual paradigms. While massively parallel simulators have catalyzed breakthroughs in proprioception-based locomotion, their potential remains largely untapped for vision-informed tasks due to the prohibitive computational overhead of large-scale photorealistic rendering. Furthermore, the creation of simulation-ready 3D assets heavily relies on labor-intensive manual modeling, while the significant sim-to-real physical gap hinders the transfer of contact-rich manipulation policies. To address these bottlenecks, we propose GS-Playground, a multi-modal simulation framework designed to accelerate end-to-end perceptual learning. We develop a novel high-performance parallel physics engine, specifically designed to integrate with a batch 3D Gaussian Splatting (3DGS) rendering pipeline to ensure high-fidelity synchronization. Our system achieves a breakthrough throughput of 10^4 FPS at 640x480 resolution, significantly lowering the barrier for large-scale visual RL. Additionally, we introduce an automated Real2Sim workflow that reconstructs photorealistic, physically consistent, and memory-efficient environments, streamlining the generation of complex simulation-ready scenes. Extensive experiments on locomotion, navigation, and manipulation demonstrate that GS-Playground effectively bridges the perceptual and physical gaps across diverse embodied tasks. Project homepage: https://gsplayground.github.io.
Chinese Translation
具身人工智能研究正朝着以视觉为中心的感知范式转变。尽管大规模并行模拟器在基于本体感知的运动中催生了突破性进展,但由于大规模逼真渲染的计算开销过于庞大,其在视觉驱动任务中的潜力仍未得到充分利用。此外,模拟就绪的3D资产的创建在很大程度上依赖于劳动密集型的手动建模,而显著的模拟与现实之间的物理差距则阻碍了接触丰富的操作策略的转移。为了解决这些瓶颈,我们提出了GS-Playground,一个旨在加速端到端感知学习的多模态模拟框架。我们开发了一种新型高性能并行物理引擎,专门设计用于与批量3D高斯点云(3D Gaussian Splatting, 3DGS)渲染管道集成,以确保高保真同步。我们的系统在640x480分辨率下实现了10^4 FPS的突破性吞吐量,显著降低了大规模视觉强化学习的门槛。此外,我们还引入了一种自动化的Real2Sim工作流,重建逼真、物理一致且内存高效的环境,简化了复杂模拟就绪场景的生成。在运动、导航和操作方面的广泛实验表明,GS-Playground有效地弥合了多样化具身任务中的感知与物理差距。项目主页:https://gsplayground.github.io。