1 Matching Annotations
  1. Last 7 days
    1. on tasks that take a human expert 90 minutes to 3 hours, a GPT-5 agent (with time horizon of around 2 hours and 17 minutes) succeeds 100% of the time for around one-third of the tasks, fails 100% of the time for around one-third of the tasks, and sometimes succeeds and sometimes fails on the remaining third of tasks.

      「三分之一全成,三分之一全败,三分之一随机」——这个分布揭示了当前 AI 能力的真实形态:不是一个平滑的能力曲线,而是一个双峰的「能做 / 不能做」分布,中间夹着一个随机带。这意味着给 AI 分配任务时,「试一次」的结果几乎没有参考价值——你需要多次运行才能判断这个任务属于哪个区间。对 AI 产品设计者而言,这个分布是可靠性设计的核心约束。