Hypothesis

2 Matching Annotations

Jun 2026
openai.com openai.com

Predicting model behavior before release by simulating deployment

1
1. fxp007 18 Jun 2026
  
  in Public
  
  our predictions had a median multiplicative error of 1.5x
  
  中位数乘法误差1.5x是什么概念？如果某个不良行为的真实发生率是10/100k，预测值会落在6.67/100k到15/100k之间。对于安全决策来说，这个精度已经足够判断方向（是否增加/减少）、排序风险优先级，以及决定是否部署。但尾部误差可达10x——这意味着对于某些行为，预测和现实可能相差一个数量级。OpenAI诚实地承认这个局限性，并指出主要误差来源是模拟环境保真度，而非提示分布偏移，这是一个值得追踪的工程改进方向。
  
  1.5x误差预测精度安全决策
Visit annotations in context

Tags

安全决策

预测精度

1.5x误差

Annotators

fxp007

URL

openai.com/index/deployment-simulation/
deepmind.google deepmind.google

Untitled document

1
1. fxp007 12 Jun 2026
  
  in Public
  
  our recent work on AI Agent Traps explores vulnerabilities agents face in adversarial environments
  
  Agent Traps这个概念值得单独关注。这描述的不是传统的模型安全漏洞，而是专门针对自主决策过程的攻击向量。当AI智能体在数字经济中自主操作时，针对其决策逻辑而非其权重的攻击将成为新威胁面。比如：操纵某个智能体的信息环境，让它做出对攻击者有利的决策。这类攻击在大规模多智能体交互中尤其难以检测和归因。
  
  Agent Traps 对抗性攻击决策安全
Visit annotations in context

Tags

对抗性攻击

决策安全

Agent Traps

Annotators

fxp007

URL

deepmind.google/blog/investing-in-multi-agent-ai-safety-research/

Tags

Annotators

URL

Tags

Annotators

URL