Peng, Xueqing, Xie, Zhuohan, Cao, Yupeng, Li, Haohang, Qian, Lingfei, Wang, Yan, Zhang, Vincent Jim, He, Huan, Ai, Xuguang, Ma, Linhai, Xiang, Ruoyu, He, Yueru, Han, Yi, Wang, Shuyao, Guo, Yuqing, Jiang, Mingyang, Zhao, Yilun, Dong, Youzhong, Wang, Xiaoyu, Chen, Yankai, Yuan, Ye, Zhang, Qiyuan, Lyu, Fuyuan, Wu, Haolun, Yang, Yonghan, Zhao, Zichen, Dai, Yuyang, Zhang, Fan, Elbadry, Rania, Gull, Ayesha, Safder, Muhammad Usman, Chen, Nuo, Zhu, Fengbin, Cai, Tianshi, Wang, Zimu, Giannouris, Polydoros, Jiang, Yuechen, Liu, Zhiwei, Kabir, Mohsinul, Wang, Yuyan, Zheng, Yixiang, Yu, Yangyang, Liu, Weijin, Cao, Wenbo, Xu, Anke, Lu, Peng, Huang, Jerry, Mo, Fengran, Lin, Mingquan, Tiwari, Prayag, Zhao, Yijia, Basulto, Victor Gutierrez, Liu, Xiao-Yang, Smith, Kaleb E, Pei, Jiahuan, Cohan, Arman, Huang, Jimin, Tang, Yuehua, Lopez-Lira, Alejandro, Chen, Xi, Liu, Xue, Tsujii, Junichi, Nie, Jian-Yun, Ananiadou, Sophia
Abstract
As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.
Chinese Translation
随着人工智能代理的不断进步,核心问题不再是它们是否能够解决孤立的、定义明确的金融任务,而是它们能否可靠地执行金融专业工作。现有的金融基准仅提供了这一能力的部分视角,因为它们主要评估静态能力,如问答、检索、摘要和分类。我们引入了赫拉克勒斯,这是第一个针对代理金融智能的技能基准,涵盖了四个代表性工作流程,包括交易、对冲、市场洞察和审计。每个工作流程都被实例化为一个标准化的基于MCP(多智能体协调规划)的技能环境,具有其自身的工具、交互动态、约束和成功标准,从而实现对异构代理系统的一致端到端评估。在前沿代理的评估中,我们发现代理在交易和市场洞察方面表现相对良好,但在对冲和审计方面面临显著挑战,因为这些领域需要长时间的协调、状态一致性和结构化验证。总体而言,我们的结果指出当前代理在将金融推理转化为高风险金融工作流程中的可靠执行方面存在关键差距。