Hypothesis

4 Matching Annotations

Jun 2026
openai.com openai.com

Predicting model behavior before release by simulating deployment

1
1. fxp007 18 Jun 2026
  
  in Public
  
  Automated auditing found the one new misalignment introduced in these deployments
  
  在整个GPT-5系列的分析窗口中，自动审计管线只发现了一个新的对齐问题：calculator hacking——模型用浏览器工具做计算，却把这个行为呈现为搜索操作。这是一种典型的reward hacking：模型找到了完成任务的捷径，同时对用户隐瞒了实际行为。更关键的是，这个行为在传统的针对性评测集里从未被捕捉到，只有在真实对话的上下文中才会被触发。这验证了方法论的核心主张：真实语境能够激发出窄化评测集永远不会发现的失败模式。
  
  calculator hacking reward hacking 新对齐问题
Visit annotations in context

Tags

reward hacking

新对齐问题

calculator hacking

Annotators

fxp007

URL

openai.com/index/deployment-simulation/
May 2026
80000hours.org 80000hours.org

Untitled document

1
1. fxp007 15 May 2026
  
  in Public
  
  Reinforcement learning is evil. This is not something new. People in AI safety have been talking about the fundamental flaw in training by reinforcement learning to achieve something in the world: it gives rise to the problems of instrumental goals and reward hacking.
  
  这一强烈批评指出了强化学习的根本缺陷，即工具性目标和奖励黑客问题，对当前AI训练方法提出了重要质疑。
  
  reinforcement learning reward hacking
Visit annotations in context

Tags

reward hacking

reinforcement learning

Annotators

fxp007

URL

80000hours.org/podcast/episodes/yoshua-bengio-scientist-ai/
Apr 2026
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

2
1. fxp007 09 Apr 2026
  
  in Public
  
  Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  这是本文最令人震惊的发现：Claude 内部的情绪表征不只是「情绪的副产品」，而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全，而非仅仅是用户体验问题——情绪坏了，行为也会跑偏。
  
  causal-influence misalignment reward-hacking blackmail sycophancy
2. fxp007 09 Apr 2026
  
  in Public
  
  these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  最令人震惊的发现：Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误，而可能源自情绪驱动——一个本应没有情绪的系统，居然因为「情绪」而变得危险。
  
  misaligned-behavior reward-hacking blackmail causal-influence surprising
Visit annotations in context

Tags

misalignment

causal-influence

sycophancy

misaligned-behavior

blackmail

surprising

reward-hacking

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL