V Team, Hong, Wenyi, Gu, Xiaotao, Pan, Ziyang, Yang, Zhen, Wang, Yuting, Wang, Yue, Yue, Yuanchang, Wang, Yu, Wang, Yanling, Wang, Yan, Liu, Xijun, Yu, Wenmeng, Wang, Weihan, Li, Wei, Duan, Shuaiqi, Yang, Sheng, Lv, Ruiliang, Liu, Mingdao, Pan, Lihang, Ning, Ke, Ji, Junhui, Wang, Jinjiang, Chen, Jing, Xu, Jiazheng, Zhu, Jiale, Cheng, Jiale, Qi, Ji, Gan, Guobing, Wang, Guo, Yao, Cong, Dou, Zijun, Zhou, Zihao, Wang, Zihan, Ge, Zhiqi, Li, Zhijie, Hou, Zhenyu, Xue, Zhao, Wang, Zehui, He, Zehai, Liu, Yusen, Cen, Yukuo, Li, Yuchen, Wang, Yuan, Lu, Yijian, Wang, Yanzi, Xue, Yadong, Zhang, Xinyu, Liu, Xinyu, Li, Wenkai, Tong, Tianyu, Zhang, Tianshu, Yan, Shengdong, Zheng, Qinkai, Xu, Mingde, Bao, Licheng, Xu, Jiaxing, Fan, Jiaxin, Qian, Jiawen, Chen, Jiali, Lin, Jiahui, Zheng, Haozhi, Wang, Haoran, Li, Haochen, Yang, Fan, Zhang, Dan, Zhao, Chuangxin, Wu, Chengcheng, Shi, Boyan, Jia, Bowei, Wang, Baoxu, Zhang, Peng, Liu, Debing, Xu, Bin, Li, Juanzi, Huang, Minlie, Dong, Yuxiao, Tang, Jie
Abstract
We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.
Chinese Translation
我们提出了GLM-5V-Turbo,这是迈向多模态智能体原生基础模型的一步。随着基础模型在真实环境中的应用日益增多,智能体的能力不仅依赖于语言推理,还依赖于在图像、视频、网页、文档和图形用户界面等异构环境中感知、解释和行动的能力。GLM-5V-Turbo围绕这一目标构建:多模态感知被整合为推理、规划、工具使用和执行的核心组成部分,而不是作为语言模型的辅助接口。本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展和与智能体框架集成等方面的主要改进。这些发展在多模态编码、视觉工具使用和基于框架的智能体任务中表现出强大的性能,同时保持了竞争性的文本编码能力。更重要的是,我们的开发过程为构建多模态智能体提供了实用的见解,强调了多模态感知、分层优化和可靠的端到端验证的核心作用。