5 Matching Annotations
  1. Last 7 days
    1. On average Claude models have an SWE-ECI 2.7 points higher than their general ECI, and a Math-ECI 1.8 points lower.

      这个数据点显示了Claude模型在软件工程和数学领域的表现差异。2.7分的软件工程优势和1.8分的数学劣势表明Claude确实在软件工程方面表现相对更好,而在数学方面相对较弱。这种差异虽然不算巨大,但方向性明显,与文章标题的论点一致。数据来自多个模型的平均值,具有一定统计意义。

  2. May 2026
    1. As of March 2026, AI systems are able to post-train models to get about half as much of the uplift as ones trained by humans. The specific eval scores are derived by a 'weighted average is taken across all post-trained LLMs... The top-scoring systems as of April get 25%-28% (Opus 4.6, and GPT 5.4), compared to a human score of 51%.'

      在模型微调任务上,AI系统已能达到人类研究员51%性能的一半,显示出AI在科研任务上的显著进步。

  3. Apr 2026
    1. Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.

      大多数人可能认为不同类型的AI模型性能提升速度大致相同,但研究发现推理模型不仅有一次性的性能飞跃,而且提升速度是非推理模型的2-3倍。这一发现颠覆了人们对不同模型类型进步速度的预期。

    2. Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.

      2-3倍的速度差异是一个非常显著的数字,表明推理模型与非推理模型之间存在明显的性能差距。这个倍数关系暗示了架构变化可能带来的性能飞跃,而非简单的线性改进。这一数据点支持了推理能力可能是AI进步关键驱动力的假设。

    1. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks

      大多数人认为当前最先进的多模态大模型已经接近或超越人类在复杂任务上的表现。然而,作者的数据表明,即使是最好的模型在复杂现实任务上的表现也远低于预期,准确率从整体56.3%骤降至23.0%。这一发现挑战了AI领域对当前技术能力的乐观评估,揭示了现实世界多模态代理任务的极端复杂性。