Chu, Meng, Zhang, Xuan Billy, Lin, Kevin Qinghong, Kong, Lingdong, Zhang, Jize, Tu, Teng, Ma, Weijian, Huang, Ziqi, Yang, Senqiao, Huang, Wei, Jin, Yeying, Rao, Zhefan, Ye, Jinhui, Lin, Xinyu, Zhang, Xichen, Hu, Qisheng, Yang, Shuai, Shen, Leyang, Chow, Wei, Dong, Yifei, Wu, Fengyi, Long, Quanyu, Xia, Bin, Yu, Shaozuo, Zhu, Mingkang, Zhang, Wenhu, Huang, Jiehui, Gui, Haokun, Che, Haoxuan, Chen, Long, Chen, Qifeng, Zhang, Wenxuan, Wang, Wenya, Qi, Xiaojuan, Deng, Yang, Li, Yanwei, Shou, Mike Zheng, Cheng, Zhi-Qi, Ng, See-Kiong, Liu, Ziwei, Torr, Philip, Jia, Jiaya
Abstract
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
Chinese Translation
随着人工智能系统从生成文本转向通过持久互动实现目标,建模环境动态的能力成为一个核心瓶颈。需要操控物体、导航软件、与他人协调或设计实验的智能体都需要预测环境模型,然而,‘世界模型’这一术语在不同的研究社区中具有不同的含义。我们引入了一个以“层级 x 法则”为基础的分类法,沿两个轴线进行组织。第一个轴线定义了三个能力层级:L1 预测者,学习一步本地转移算子;L2 模拟器,将其组合成遵循领域法则的多步、动作条件的展开;以及 L3 进化者,在预测与新证据不符时,能够自主修订其模型。第二个轴线识别出四种管理法则:物理、数字、社会和科学。这些法则决定了世界模型必须满足的约束以及最可能失败的地方。利用这一框架,我们综合了400多项研究工作,并总结了100多个代表性系统,涵盖基于模型的强化学习、视频生成、网络和图形用户界面代理、多智能体社会模拟以及人工智能驱动的科学发现。我们分析了不同层次-法则对的研究方法、失败模式及评估实践,提出了以决策为中心的评估原则和一个最小可重复的评估包,并概述了架构指导、开放问题和治理挑战。最终形成的路线图连接了以前孤立的社区,并勾勒出从被动的下一步预测到能够模拟并最终重塑智能体所处环境的世界模型的路径。