2 Matching Annotations
  1. Apr 2026
    1. _Self-reported score with custom Anthropic scaffold._ SWEPro were evaluated with the mini-swe-agent scaffold. However, we use the scores reported by Anthropic for Opus with the max thinking efforts due to frequent timeouts during our evaluation trials.

      脚注2揭示了重要数据点:Opus 4.6的53.4分是Anthropic的自报分数,因为作者在评估过程中频繁遇到超时问题,无法自行验证。这表明性能比较中存在数据可靠性问题,特别是对于Opus的评估依赖于厂商自报数据,可能存在偏差。

    1. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories.

      主流评估方法通常只关注最终答案的正确性,而作者提出了一种革命性的评估方法:关注中间过程状态并引入'过度思考'指标来衡量效率。这一观点与当前AI评估领域的传统做法背道而驰,暗示单纯追求正确答案可能掩盖了AI系统在效率和推理路径上的严重缺陷。