Akhtar, Mubashara, Reuel, Anka, Soni, Prajna, Ahuja, Sanchit, Ammanamanchi, Pawan Sasanka, Rawal, Ruchit, Zouhar, Vilém, Yadav, Srishti, Whitehouse, Chenxi, Ki, Dayeon, Mickel, Jennifer, Choshen, Leshem, Šuppa, Marek, Batzner, Jan, Chim, Jenny, Sania, Jeba, Long, Yanan, Rahmani, Hossein A., Knight, Christina, Nan, Yiyang, Raj, Jyoutir, Fan, Yu, Singh, Shubham, Sahoo, Subramanyam, Habba, Eliya, Gohar, Usman, Pawar, Siddhesh, Scholz, Robert, Subramonian, Arjun, Ni, Jingwei, Kochenderfer, Mykel, Koyejo, Sanmi, Sachan, Mrinmaya, Biderman, Stella, Talat, Zeerak, Ghosh, Avijit, Solaiman, Irene
Abstract
Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
Chinese Translation
人工智能(AI)基准在衡量模型开发进展和指导部署决策中发挥着核心作用。然而,许多基准很快就会达到饱和状态,这意味着它们无法再区分表现最佳的模型,从而降低了其长期价值。在本研究中,我们分析了从主要模型开发者的技术报告中选取的60个大型语言模型(LLM)基准的饱和情况。为了识别导致饱和的因素,我们从任务设计、数据构建和评估格式等14个属性对基准进行了特征化。我们测试了五个假设,以检验每个属性如何影响饱和率。我们的分析显示,近一半的基准表现出饱和现象,且随着基准的老化,饱和率逐渐增加。值得注意的是,隐藏测试数据(即公共与私有)并未显示出保护作用,而专家策划的基准比众包的基准更能抵御饱和。我们的研究结果强调了哪些设计选择可以延长基准的寿命,并为更持久的评估策略提供了指导。