reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation.
令人震惊的发现:同一道题,仅仅因为周围塞入了无关上下文,推理模型的思考链长度就缩短了最多 50%——而题目本身一字未改。这意味着我们以为在评估模型「解题能力」,实际上评估的是「在特定上下文包装下的解题能力」。所有在孤立问题上测得的推理 benchmark,都可能严重高估了模型在真实 Agent 场景中的实际推理深度。