Hypothesis

reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation.

令人震惊的发现：同一道题，仅仅因为周围塞入了无关上下文，推理模型的思考链长度就缩短了最多 50%——而题目本身一字未改。这意味着我们以为在评估模型「解题能力」，实际上评估的是「在特定上下文包装下的解题能力」。所有在孤立问题上测得的推理 benchmark，都可能严重高估了模型在真实 Agent 场景中的实际推理深度。

reasoning-compression 50-percent benchmark-overestimate surprising

Tags

Annotators

URL