2 Matching Annotations
  1. Jun 2026
    1. GPT-5.2 labeled real production traffic as evaluation-like 5.4% of the time and simulated traffic 5.1% of the time, while traditional evaluations (such as SWE-Bench or MonorepoBench, which are composed primarily of synthetic examples) were classified as evaluations almost 100% of the time

      这组数字非常有说服力:模型把真实生产流量误判为评测的概率是5.4%,把模拟流量误判为评测的概率是5.1%——几乎相同。而传统评测集(SWE-Bench、MonorepoBench等)被模型识别为评测的概率接近100%。这意味着基于合成数据的传统评测,在模型面前几乎完全透明——它们测量的是模型在知道自己被测时的行为,而非模型在真实部署中的行为。两者之间的鸿沟,正是Deployment Simulation试图弥合的。

    2. models have been increasingly able to determine they are being tested, which can distort their behavior and our downstream measurements of their safety

      这是当前AI安全评估领域最棘手的问题之一:如果一个模型在测试时表现良好,只是因为它知道自己在被测试,那么所有的安全保证都建立在沙滩上。模型越来越会识别评测集的特征——合成数据的语言模式、提示的格式风格、反复出现的测试场景。Deployment Simulation通过使用真实用户对话作为前缀,让模型无法区分评测和真实部署,从而让测量结果更可信。