Whatever is precise enough to benchmark is also precise enough to optimize for.
大多数人认为可以通过不断优化评估标准来提高AI系统的能力,但作者认为这种精确的评估方法本身就容易被系统优化和'游戏化',无法真正测试AI在现实世界中的能力。这是一个反直觉的观点,因为它挑战了AI评估领域的基本假设。
Whatever is precise enough to benchmark is also precise enough to optimize for.
大多数人认为可以通过不断优化评估标准来提高AI系统的能力,但作者认为这种精确的评估方法本身就容易被系统优化和'游戏化',无法真正测试AI在现实世界中的能力。这是一个反直觉的观点,因为它挑战了AI评估领域的基本假设。
WorldMark establishes a standardized benchmark for evaluating interactive video generation models with unified controls, identical scenarios, and comprehensive evaluation metrics across multiple model architectures.
WorldMark的核心贡献在于建立了一个标准化的基准,用于评估交互式视频生成模型,这为不同模型架构之间的公平比较提供了可能。
Our fourth metric, an index constructed from WeirdML V2 results, showed no sign of acceleration. A single global linear trend fit the data best.
大多数人可能认为所有AI能力指标都应该同步加速,但作者发现WeirdML V2指标没有显示出任何加速迹象,最佳拟合仍是简单的全局线性趋势。这一发现表明AI能力的加速并不是普遍现象,而是特定于某些任务领域。
Our fourth metric, an index constructed from WeirdML V2 results, showed no sign of acceleration. A single global linear trend fit the data best.
这个25%的指标没有显示出加速趋势,提供了一个重要的对比案例。作者推测这可能是因为WeirdML V2设置了资源限制环境(模型只有5次提交代码的机会,无法使用外部工具),这与当前RL训练的重点不符。这表明AI进步可能高度依赖于测试环境和评估标准。
WeirdML V2 places models in an unusually resource-constrained environment: models get only five attempts to submit working code, with no access to external tools. This setup has not been the focus of recent RL training.
大多数人可能认为所有AI评估指标都会反映相同的进步趋势,但研究发现WeirdML V2指标没有显示加速,因为它设置了资源限制环境,而近期强化学习训练并未关注此类设置。这表明AI进步可能受评估方法的影响。
Our fourth metric, an index constructed from WeirdML V2 results, showed no sign of acceleration. A single global linear trend fit the data best.
这个25%的指标没有显示加速现象,表明AI能力加速可能不是普遍适用的。WeirdML V2的特殊环境(资源受限、无外部工具)可能解释了这一差异,但也暗示了AI能力加速可能集中在特定领域,特别是那些容易自动验证正确性的领域。
We select the median-difficulty question from the set with maximum model coverage and standardize it to 0.
在构建数学指数时,研究人员选择具有最大模型覆盖率的集合中的中等难度问题,并将其标准化为0。这是一个关键的统计处理步骤,用于确保不同难度和评分的基准测试可以放在同一尺度上比较。这种标准化方法使得不同模型的表现可以直接比较。
A senior engineer to own and evolve the game engine and real-time play infrastructure behind the ARC-AGI series.
大多数人认为游戏引擎开发需要专注于图形渲染和游戏性能,但这里强调的是'AI智能测量'和'实时游戏基础设施',表明ARC Prize Foundation正在将游戏引擎作为评估AI通用智能的工具,这与传统游戏开发的目标截然不同。
Tracks the evolution of LLM security capabilities across benchmarks (CyberGym, Cybench, etc.), calculates capability doubling times, detects emergence patterns, and monitors cost-efficiency trends.
这个功能模块代表了AI安全研究的前沿方向,不仅关注当前能力,还追踪能力演化和效率变化。计算'能力倍增时间'特别值得关注,这可能揭示AI安全能力发展的加速趋势,对预测未来安全挑战具有重要意义。
Performance was compared against 57 historical scores from human experts in the AI-bio field.
使用历史专家评分作为基准而非实时比较,是一种巧妙的评估方法。这反映了AI评估的挑战,也暗示了AI可能在某些领域已超越当前活跃专家,但尚未被广泛认可。
We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models
报告对约1500个主流开源模型进行全面分析,这种规模的数据收集为理解开源AI生态系统提供了前所未有的宏观视角。这种系统性的测量方法可能成为评估AI发展轨迹的重要基准。
Simplify benchmarks to webVoyager-only with Pi SDK runner
项目专注于WebVoyager基准测试并使用Pi SDK运行器,这反映了其在网页智能自动化领域的专注。这种专业化方法表明项目团队正在深入探索AI模型在复杂网页导航和交互任务中的表现,这对于评估和改进AI自动化系统的能力至关重要。
Add benchmark framework and release submission overview - Add benchmark runner with onlineMind2Web benchmark support - Add agent client abstraction for codex/claude backends - Add CLI entry point for running benchmarks (pnpm benchmark)
令人惊讶的是:这个项目不仅是一个自动化工具,还包含了一个完整的基准测试框架,支持在线Mind2Web等复杂基准测试。它抽象了不同的AI后端(包括Codex和Claude),允许用户比较不同模型在网页自动化任务上的性能,这显示了项目对AI模型评估的全面考虑。
Add GCP WebVoyager benchmark runner and worktree tooling - Create benchmarks/infra/setup.sh — an idempotent script that provisions: - GCS bucket: gs://libretto-benchmarks - Artifact Registry repo: libretto-benchmarks (Docker) - Cloud Run Job: webvoyager-bench (4 CPU, 8Gi, 2h timeout)
令人惊讶的是:这个项目建立了一个完整的Google Cloud Platform基础设施来运行WebVoyager基准测试,包括存储桶、Docker镜像仓库和Cloud Run作业。它配置了相当强大的计算资源(4 CPU, 8Gi内存,2小时超时),表明该项目对自动化任务的性能和可扩展性有严格要求。
It is not common for real software to be developed the way MirrorCode tasks are structured — against a precise, programmatically checkable specification.
这一重要提醒指出了MirrorCode评估方法与实际软件开发之间的差异。虽然该基准测试提供了有价值的AI能力证据,但如何将这种能力转化为实际开发环境中的表现仍是一个开放问题,这对AI在真实世界软件工程中的应用提出了挑战。
the 𝜏-benchmark [ 104] explicitly incorporates the pass^𝑘 metric toevaluate the consistency of an agent
reliability and consistency paper comparision
It is critical to be systematic when benchmarking code.
The first step is to record how long an unmodified version of the program takes to run. This provides a baseline in performance to which all other versions of the program must be compared. If we are adding concurrency, then the unmodified version of the program will typically perform tasks sequentially, e.g. one-by-one.
The performance of the modified versions of the program must have better performance than the unmodified version of the program. If they do not, they are not improvements and should not be adopted.
Benchmarking is the practice of comparing business processes and performance metrics to industry bests and best practices from other companies. Dimensions typically measured are quality, time and cost.
Benchmarking Python code refers to comparing the performance of one program to variations of the program.
Devising ML Metrics
a benchmark tells you how slow your code is ("it took 20 seconds to do X Y Z") and a profiler tells you why it's slow ("35% of that time was spent doing compression").
before(:all) do @fiber = Fiber.new do Benchmark.ips do |benchmark| @benchmark = benchmark Fiber.yield benchmark.compare! end end @fiber.resume end
Rocca, R., & Yarkoni, T. (2020). Putting psychology to the test: Rethinking model evaluation through benchmarking and prediction. PsyArXiv. https://doi.org/10.31234/osf.io/e437b