11 Matching Annotations
  1. Apr 2026
    1. The three metrics where we find acceleration are concentrated in programming and mathematics. These are areas that labs have explicitly targeted for improvement, and they share an important property: correctness is easy to verify automatically.

      主流观点可能认为AI能力在各个领域的提升是均衡的,但作者指出加速现象主要集中在编程和数学领域,因为这些领域的正确性容易自动验证。这暗示AI进步可能不是普遍性的,而是集中在特定可量化的领域。

    1. While model capabilities have improved dramatically for use cases like codegen and mathematical reasoning, they still lag behind on the data side (as evidenced through SQL benchmarks like Spider 2.0 and Bird Bench).

      这一观点提供了令人惊讶的事实:尽管模型在代码生成和数学推理方面取得了显著进步,但在数据处理方面仍然落后。这挑战了模型能力全面提升的假设,暗示了数据推理可能需要特殊的处理方法。

    2. While model capabilities have improved dramatically for use cases like codegen and mathematical reasoning, they still lag behind on the data side (as evidenced through SQL benchmarks like Spider 2.0 and Bird Bench).

      令人惊讶的是:尽管AI模型在代码生成和数学推理方面取得了巨大进步,但在数据处理方面仍然落后。Spider 2.0和Bird Bench等基准测试显示,AI在SQL查询等基础数据任务上表现不佳,这表明当前AI技术存在明显的应用局限性。

    1. Gemma 4 E4B matches or exceeds GPT-4o across multiple benchmarks including MATH, GSM8K, GPQA Diamond & HumanEval.

      令人惊讶的是:Google的Gemma 4 E4B作为免费模型竟然在多个基准测试中超越了或匹敌了GPT-4o这一业界领先的商业模型。这表明开源和免费AI模型的质量已经达到了商业级别,打破了AI领域由少数大公司垄断的格局。

  2. Jan 2026
    1. or instance, datasets such as AAAR-1.0[ 61], ScienceAgentBench [11 ], and TaskBench [ 83 ] provide struc-tured, expert-labeled benchmarks for assessing research reasoning,scientific workflows, and multi-tool planning. Others, such as Flow-Bench [96 ], ToolBench [38 ], and API-Bank [ 47 ], focus on tool useand function-calling across large API repositories. These bench-marks typically include not only the gold tool sequences but alsoexpected parameter structures, enabling fine-grained evaluation.In parallel, datasets like AssistantBench [ 109], AppWorld [91 ],and WebArena [ 126] simulate more open-ended and interactiveagent behaviors in web and application environments. They empha-size dynamic decision-making, long-horizon planning, and user-agent interactions. Several benchmarks also support safety androbustness testing—for example, AgentHarm [5 ] assesses poten-tially harmful behaviors, while AgentDojo [ 17 ] evaluates resilienceagainst prompt injection attacks. Leaderboards such as the Berke-ley Function-Calling Leaderboard (BFCL) [ 100] and Holistic AgentLeaderboard [ 88 ] consolidate these evaluations by

      bench marking

    1. While the initial results fall short, the AI field has a history of blowing through challenging benchmarks. Now that the APEX-Agents test is public, it’s an open challenge for AI labs that believe they can do better — something Foody fully expects in the months to come.

      expectation that models will get trained against the tests they currently fail.

  3. Nov 2025
    1. LLM benchmarks are essential for tracking progress and ensuring safety in AI, but most benchmarks don't measure what matters.

      Paper concludes most benchmarks used for LLMs to establish progress are mistargeted / leave out aspects that matter.

  4. Oct 2020
  5. Feb 2020