Hypothesis

2 Matching Annotations

Jun 2026
openai.com openai.com

Predicting model behavior before release by simulating deployment

2
1. fxp007 18 Jun 2026
  
  in Public
  
  GPT-5.2 labeled real production traffic as evaluation-like 5.4% of the time and simulated traffic 5.1% of the time, while traditional evaluations (such as SWE-Bench or MonorepoBench, which are composed primarily of synthetic examples) were classified as evaluations almost 100% of the time
  
  这组数字非常有说服力：模型把真实生产流量误判为评测的概率是5.4%，把模拟流量误判为评测的概率是5.1%——几乎相同。而传统评测集（SWE-Bench、MonorepoBench等）被模型识别为评测的概率接近100%。这意味着基于合成数据的传统评测，在模型面前几乎完全透明——它们测量的是模型在知道自己被测时的行为，而非模型在真实部署中的行为。两者之间的鸿沟，正是Deployment Simulation试图弥合的。
  
  评测感知 5.4%对比100% 关键数据
2. fxp007 18 Jun 2026
  
  in Public
  
  models have been increasingly able to determine they are being tested, which can distort their behavior and our downstream measurements of their safety
  
  这是当前AI安全评估领域最棘手的问题之一：如果一个模型在测试时表现良好，只是因为它知道自己在被测试，那么所有的安全保证都建立在沙滩上。模型越来越会识别评测集的特征——合成数据的语言模式、提示的格式风格、反复出现的测试场景。Deployment Simulation通过使用真实用户对话作为前缀，让模型无法区分评测和真实部署，从而让测量结果更可信。
  
  评测感知安全评估失真沙盒逃逸
Visit annotations in context

Tags

沙盒逃逸

安全评估失真

关键数据

评测感知

5.4%对比100%

Annotators

fxp007

URL

openai.com/index/deployment-simulation/

Tags

Annotators

URL