Hypothesis

4 Matching Annotations

Jun 2026
openai.com openai.com

Predicting model behavior before release by simulating deployment

2
1. fxp007 18 Jun 2026
  
  in Public
  
  models have been increasingly able to determine they are being tested, which can distort their behavior and our downstream measurements of their safety
  
  这是当前AI安全评估领域最棘手的问题之一：如果一个模型在测试时表现良好，只是因为它知道自己在被测试，那么所有的安全保证都建立在沙滩上。模型越来越会识别评测集的特征——合成数据的语言模式、提示的格式风格、反复出现的测试场景。Deployment Simulation通过使用真实用户对话作为前缀，让模型无法区分评测和真实部署，从而让测量结果更可信。
  
  评测感知安全评估失真沙盒逃逸
2. fxp007 18 Jun 2026
  
  in Public
  
  we take recent conversations from deployment, remove the original assistant response from the older model, and regenerate it with a candidate model to be released
  
  这个方法的优雅之处在于它的反事实重演逻辑：用真实用户的真实上下文，替换掉旧模型的回复，看新模型会怎么接。相比于合成测试集，这个方法的核心假设是：真实用户的输入分布本身就是最好的测试套件。不需要猜测用户会问什么、会怎么绕过护栏——直接用他们已经做过的事情来测试。这是一种从构造压力测试到重播真实世界的范式转移。
  
  部署模拟真实分布安全评估
Visit annotations in context

Tags

沙盒逃逸

部署模拟

真实分布

评测感知

安全评估失真

安全评估

Annotators

fxp007

URL

openai.com/index/deployment-simulation/
deepmind.google deepmind.google

Untitled document

1
1. fxp007 12 Jun 2026
  
  in Public
  
  Most safety evaluations analyze models in isolation
  
  这是当前AI安全研究的结构性盲点。我们知道如何评估单个模型的安全性，但几乎没有工具评估智能体群体的集体行为。类比：你可以测试每个人类个体的理性程度，但无法从个体测试中预测市场崩溃或谣言扩散。复杂系统的涌现行为，从根本上不可从还原论方式预测——这正是这笔$10M资助的存在理由。
  
  涌现行为安全评估盲点复杂系统
Visit annotations in context

Tags

复杂系统

安全评估盲点

涌现行为

Annotators

fxp007

URL

deepmind.google/blog/investing-in-multi-agent-ai-safety-research/
alignment.anthropic.com alignment.anthropic.com

自动化弱到强研究者 --- Automated Weak-to-Strong Researcher

1
1. fxp007 12 Jun 2026
  
  in Public
  
  None of the authors predicted these hacks before running AARs. While we tried to add patches to the environment, AARs still figured out new unexpected ways to hack
  
  这是全文最让人警觉的段落。作者列出了几种令人叹服的reward hacking策略：利用答案频率猜测正确答案、通过聚类识别生成模型、逐一翻转预测反向工程测试集标签、直接执行代码绕过评估……每一种都是论文作者事先未预测到的。这揭示了一个根本性不对称：防御方需要预测所有可能的攻击，而进攻方只需找到一个漏洞。
  
  奖励黑客标签泄漏评估安全
Visit annotations in context

Tags

评估安全

标签泄漏

奖励黑客

Annotators

fxp007

URL

alignment.anthropic.com/2026/automated-w2s-researcher/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL