Hypothesis

5 Matching Annotations

May 2026
epoch.ai epoch.ai

https://epoch.ai/data-insights/claude-ds-eci

1
1. fxp007 19 May 2026
  
  in Public
  
  On average Claude models have an SWE-ECI 2.7 points higher than their general ECI, and a Math-ECI 1.8 points lower.
  
  这个数据点显示了Claude模型在软件工程和数学领域的表现差异。2.7分的软件工程优势和1.8分的数学劣势表明Claude确实在软件工程方面表现相对更好，而在数学方面相对较弱。这种差异虽然不算巨大，但方向性明显，与文章标题的论点一致。数据来自多个模型的平均值，具有一定统计意义。
  
  data-point statistics performance-gap
Visit annotations in context

Tags

data-point

statistics

performance-gap

Annotators

fxp007

URL

epoch.ai/data-insights/claude-ds-eci
jack-clark.net jack-clark.net

Import AI 455: Automating AI Research

1
1. fxp007 15 May 2026
  
  in Public
  
  As of March 2026, AI systems are able to post-train models to get about half as much of the uplift as ones trained by humans. The specific eval scores are derived by a 'weighted average is taken across all post-trained LLMs... The top-scoring systems as of April get 25%-28% (Opus 4.6, and GPT 5.4), compared to a human score of 51%.'
  
  在模型微调任务上，AI系统已能达到人类研究员51%性能的一半，显示出AI在科研任务上的显著进步。
  
  model-training performance-gap
Visit annotations in context

Tags

model-training

performance-gap

Annotators

fxp007

URL

jack-clark.net/2026/05/04/import-ai-455-automating-ai-research/
Apr 2026
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

2
1. fxp007 26 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  大多数人可能认为不同类型的AI模型性能提升速度大致相同，但研究发现推理模型不仅有一次性的性能飞跃，而且提升速度是非推理模型的2-3倍。这一发现颠覆了人们对不同模型类型进步速度的预期。
  
  non-consensus reasoning-models performance-gap
2. fxp007 25 Apr 2026
  
  in Public
  
  Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.
  
  2-3倍的速度差异是一个非常显著的数字，表明推理模型与非推理模型之间存在明显的性能差距。这个倍数关系暗示了架构变化可能带来的性能飞跃，而非简单的线性改进。这一数据点支持了推理能力可能是AI进步关键驱动力的假设。
  
  data-point reasoning-models performance-gap
Visit annotations in context

Tags

data-point

performance-gap

reasoning-models

non-consensus

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
arxiv.org arxiv.org

https://arxiv.org/abs/2604.03016

1
1. fxp007 08 Apr 2026
  
  in Public
  
  Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks
  
  大多数人认为当前最先进的多模态大模型已经接近或超越人类在复杂任务上的表现。然而，作者的数据表明，即使是最好的模型在复杂现实任务上的表现也远低于预期，准确率从整体56.3%骤降至23.0%。这一发现挑战了AI领域对当前技术能力的乐观评估，揭示了现实世界多模态代理任务的极端复杂性。
  
  counterintuitive performance-gap ai-capabilities
Visit annotations in context

Tags

ai-capabilities

performance-gap

counterintuitive

Annotators

fxp007

URL

arxiv.org/abs/2604.03016

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL