Hypothesis

on tasks that take a human expert 90 minutes to 3 hours, a GPT-5 agent (with time horizon of around 2 hours and 17 minutes) succeeds 100% of the time for around one-third of the tasks, fails 100% of the time for around one-third of the tasks, and sometimes succeeds and sometimes fails on the remaining third of tasks.

「三分之一全成，三分之一全败，三分之一随机」——这个分布揭示了当前 AI 能力的真实形态：不是一个平滑的能力曲线，而是一个双峰的「能做 / 不能做」分布，中间夹着一个随机带。这意味着给 AI 分配任务时，「试一次」的结果几乎没有参考价值——你需要多次运行才能判断这个任务属于哪个区间。对 AI 产品设计者而言，这个分布是可靠性设计的核心约束。

bimodal-distribution reliability GPT-5 task-prediction

Tags

Annotators

URL