AI solutions were graded by the official judges, using the same criteria as were applied to human solutions.
这个描述表明2025年IMO数学竞赛中使用了与人类相同的评判标准,这是AI评估方法的重要转变。这一数据点展示了如何利用现有的专业评估体系来创建更严格的基准测试。
AI solutions were graded by the official judges, using the same criteria as were applied to human solutions.
这个描述表明2025年IMO数学竞赛中使用了与人类相同的评判标准,这是AI评估方法的重要转变。这一数据点展示了如何利用现有的专业评估体系来创建更严格的基准测试。
AI Village gives multiple AI agents their own computer environments and a shared group chat, then tasks them with open-ended real-world goals like fundraising, organizing events, making games, and gaining subscribers.
这个案例展示了开放世界评估的实际应用,每年约5万美元的成本表明这种评估需要相当大的资源投入。相比传统基准测试,这种评估方式更接近真实应用场景,但也因此成本更高,难以大规模实施。
The volume of open-world evaluations has increased dramatically in recent months.
虽然文章没有提供具体的增长百分比,但'显著增加'的描述表明开放世界评估正在成为AI评估领域的新趋势。这种增长速度可能反映了业界对传统基准测试局限性的认识加深,以及AI能力发展到需要更复杂评估方法的阶段。
We plan to release new evaluations every 1–2 months.
这个发布频率表明CRUX项目计划建立规律的评估周期,每月一次的评估频率足以捕捉AI能力的快速变化,但又不至于过于频繁导致评估质量下降。这个频率比传统AI基准测试的更新周期要快得多,反映了当前AI技术快速迭代的特点。
_Self-reported score with custom Anthropic scaffold._ SWEPro were evaluated with the mini-swe-agent scaffold. However, we use the scores reported by Anthropic for Opus with the max thinking efforts due to frequent timeouts during our evaluation trials.
脚注2揭示了重要数据点:Opus 4.6的53.4分是Anthropic的自报分数,因为作者在评估过程中频繁遇到超时问题,无法自行验证。这表明性能比较中存在数据可靠性问题,特别是对于Opus的评估依赖于厂商自报数据,可能存在偏差。
The best-performing model across these three metrics was a pair of independent linear trends: one for reasoning models and one for non-reasoning models.
这个模型选择结果(100%的三个指标)表明将模型分为推理和非推理两类是最优预测模型。这提供了强有力的统计证据,支持推理能力可能是AI加速发展的关键因素。然而,文章没有详细说明如何定义推理模型,这可能影响结果的可靠性。
We use four AI capability metrics: ECI (Epoch Capabilities Index), METR 50% Time Horizon, Combined Math Index, and WeirdML V2 Index.
研究使用了四个不同的AI能力指标,这增加了结果的可靠性。每个指标都从不同维度测量AI能力,包括综合能力(ECI)、时间效率(METR)、数学能力(Combined Math)和特定环境下的性能(WeirdML)。多指标方法减少了单一指标的偏差风险。
benchmarks sourced from publicly available material carry contamination risk, where training-data exposure can silently inflate scores.
大多数人认为公开数据集是AI评估的金标准,能够提供客观公正的测试环境。但作者警告,使用公开材料构建的基准测试存在污染风险,训练数据接触会悄无声息地提高分数。这一观点挑战了AI评估领域的传统做法,暗示我们需要更严格的数据隔离措施或转向私有数据集进行评估。
SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories
大多数人认为AI研究数据集是静态的、一次性的收集,但作者提出'活数据集'概念,强调数据需要持续更新才能反映真实使用情况。这挑战了传统AI评估中依赖静态基准测试的做法,主张需要动态、持续的数据收集方法。
Strategic, cost-efficient evidence-building relies onstrong data governance that facilitates the access, pro-tection, and use of program and other administrativedata to enable and support secondary uses, including for
Dolgin, E. (2022). Omicron thwarts some of the world’s most-used COVID vaccines. Nature. https://doi.org/10.1038/d41586-022-00079-6
Study: Myocarditis risk 37 times higher for children with COVID-19 than uninfected peers | American Academy of Pediatrics. (n.d.). Retrieved October 10, 2021, from https://www.aappublications.org/news/2021/08/31/covid-myocarditis-risk-children-083121
Since it was founded by longtime charity executive Pierre Barnoti as the international offshoot of a Montreal animal welfare charity, SPCAI has spent little more than 20 percent of its total revenue on actual programs and services that help animals.
Taquet, M. (2021, April 15). COVID-19 and cerebral venous thrombosis: a retrospective cohort study of 513,284 confirmed COVID-19 cases. https://doi.org/10.17605/OSF.IO/H2MT7
Susan Athey, July 22, 2020. (2020, August 2). https://www.youtube.com/watch?v=hqTOPrUxDzM
Barton, C. M., Alberti, M., Ames, D., Atkinson, J.-A., Bales, J., Burke, E., Chen, M., Diallo, S. Y., Earn, D. J. D., Fath, B., Feng, Z., Gibbons, C., Hammond, R., Heffernan, J., Houser, H., Hovmand, P. S., Kopainsky, B., Mabry, P. L., Mair, C., … Tucker, G. (2020). Call for transparency of COVID-19 models. Science, 368(6490), 482.2-483. https://doi.org/10.1126/science.abb8637
DELVE group publishes evidence paper on the use of face masks in tackling Coronavirus (COVID-19) pandemic | Royal Society. (2020 May 04). https://royalsociety.org/news/2020/05/delve-group-publishes-evidence-paper-on-use-of-face-masks/
Leplaa, H. J., Rietbergen, C., & Hoijtink, H. (2020). Bayesian evaluation of replication studies [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/49tbz
Tsitsulin, A. & Perozzi B. Understanding the Shape of Large-Scale Data. (2020 May 05). Google AI Blog. http://ai.googleblog.com/2020/05/understanding-shape-of-large-scale-data.html
Davis, N. (2020, May 4). Report on face masks’ effectiveness for Covid-19 divides scientists. The Guardian. https://www.theguardian.com/world/2020/may/04/scientists-disagree-over-face-masks-effect-on-covid-19
Beitner, J., Brod, G., Gagl, B., Kraft, D., & Schultze, M. (2020, April 23). Offene Wissenschaft in der Zeit von Covid-19 – Eine Blaupause für die psychologische Forschung?. https://doi.org/10.31234/osf.io/sh8xg