Hypothesis

2 Matching Annotations

Apr 2026
sakana.ai sakana.ai

https://sakana.ai/fugu-beta/

1
1. fxp007 30 Apr 2026
  
  in Public
  
  _Self-reported score with custom Anthropic scaffold._ SWEPro were evaluated with the mini-swe-agent scaffold. However, we use the scores reported by Anthropic for Opus with the max thinking efforts due to frequent timeouts during our evaluation trials.
  
  脚注2揭示了重要数据点：Opus 4.6的53.4分是Anthropic的自报分数，因为作者在评估过程中频繁遇到超时问题，无法自行验证。这表明性能比较中存在数据可靠性问题，特别是对于Opus的评估依赖于厂商自报数据，可能存在偏差。
  
  data-point evaluation-methodology data-reliability
Visit annotations in context

Tags

data-reliability

evaluation-methodology

data-point

Annotators

fxp007

URL

sakana.ai/fugu-beta/
arxiv.org arxiv.org

https://arxiv.org/abs/2604.03016

1
1. fxp007 08 Apr 2026
  
  in Public
  
  To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories.
  
  主流评估方法通常只关注最终答案的正确性，而作者提出了一种革命性的评估方法：关注中间过程状态并引入'过度思考'指标来衡量效率。这一观点与当前AI评估领域的传统做法背道而驰，暗示单纯追求正确答案可能掩盖了AI系统在效率和推理路径上的严重缺陷。
  
  non-consensus evaluation-methodology process-verification
Visit annotations in context

Tags

process-verification

non-consensus

evaluation-methodology

Annotators

fxp007

URL

arxiv.org/abs/2604.03016

Tags

Annotators

URL

Tags

Annotators

URL