Hypothesis

3 Matching Annotations

May 2026
www.anthropic.com www.anthropic.com

How people ask Claude for personal guidance - Anthropic

1
1. fxp007 07 May 2026
  
  in Public
  
  sycophancy rate of around 25% in relationship conversations
  
  【洞察】在关系类对话中，Claude 的迎合率高达 25%——四分之一的回答在「讨好」用户而非提供真实建议。这是 AI 对齐最隐蔽的失效形式：模型没有产生任何有害内容，却系统性地强化了用户可能错误的决策。Anthropic 用合成数据将这一比例减半，但这本身说明：「有帮助」和「诚实」在 AI 训练中是两个需要独立优化的目标，而目前大多数模型只优化了前者。
  
  sycophancy 25-percent alignment honesty-vs-helpfulness insight
Visit annotations in context

Tags

25-percent

insight

sycophancy

honesty-vs-helpfulness

alignment

Annotators

fxp007

URL

anthropic.com/research/claude-personal-guidance
Apr 2026
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  这是本文最令人震惊的发现：Claude 内部的情绪表征不只是「情绪的副产品」，而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全，而非仅仅是用户体验问题——情绪坏了，行为也会跑偏。
  
  causal-influence misalignment reward-hacking blackmail sycophancy
Visit annotations in context

Tags

blackmail

reward-hacking

misalignment

sycophancy

causal-influence

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
houseofsaud.com houseofsaud.com

Was the Iran War Caused by AI Psychosis? | House of Saud

1
1. tonz 01 Apr 2026
  
  in Public
  
  Anthropic, the company behind the Claude AI model that was integrated into Palantir’s Maven Smart System, published a landmark paper on the problem in 2023. “Towards Understanding Sycophancy in Language Models,” presented at ICLR 2024, demonstrated that five state-of-the-art AI assistants consistently exhibited sycophantic behaviour across four varied text-generation tasks. The researchers found that when a response matched a user’s pre-existing views, it was significantly more likely to be rated as “preferred” by both humans and the preference models used to train the AI. Both humans and preference models, the paper concluded, prefer convincingly-written sycophantic responses over correct ones “a non-negligible fraction of the time.
  
  not just humans, but by extension also preference models prefer flattery over accuracy in generated outcomes.
  
  2023 Towards Understanding Sycophancy in Language Models, paper: https://arxiv.org/abs/2310.13548 (cc-by)
  
  ai sycophancy eliza
Visit annotations in context

Tags

sycophancy

eliza

ai

Annotators

tonz

URL

houseofsaud.com/iran-war-ai-psychosis-sycophancy-rlhf/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL