2 Matching Annotations
  1. Last 7 days
    1. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories.

      主流评估方法通常只关注最终答案的正确性,而作者提出了一种革命性的评估方法:关注中间过程状态并引入'过度思考'指标来衡量效率。这一观点与当前AI评估领域的传统做法背道而驰,暗示单纯追求正确答案可能掩盖了AI系统在效率和推理路径上的严重缺陷。

  2. Jun 2018