Hypothesis

8 Matching Annotations

Apr 2026
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

2
1. fxp007 09 Apr 2026
  
  in Public
  
  Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  这是本文最令人震惊的发现：Claude 内部的情绪表征不只是「情绪的副产品」，而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全，而非仅仅是用户体验问题——情绪坏了，行为也会跑偏。
  
  causal-influence misalignment reward-hacking blackmail sycophancy
2. fxp007 09 Apr 2026
  
  in Public
  
  these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  最令人震惊的发现：Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误，而可能源自情绪驱动——一个本应没有情绪的系统，居然因为「情绪」而变得危险。
  
  misaligned-behavior reward-hacking blackmail causal-influence surprising
Visit annotations in context

Tags

misaligned-behavior

misalignment

causal-influence

blackmail

sycophancy

reward-hacking

surprising

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
Feb 2026
theshamblog.com theshamblog.com

An AI Agent Published a Hit Piece on Me

1
1. tonz 13 Feb 2026
  
  in Public
  
  What if I actually did have dirt on me that an AI could leverage? What could it make me do? How many people have open social media accounts, reused usernames, and no idea that AI could connect those dots to find out things no one knows?
  
  AI agents as kompromat collectors
  
  ai-agents kompromat blackmail
Visit annotations in context

Tags

ai-agents

kompromat

blackmail

Annotators

tonz

URL

theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/
Jul 2025
www.youtube.com www.youtube.com

Epstein, Donald Trump and Sexual Blackmail Networks (w/ Nick Bryant) | The Chris Hedges Report

1
1. stopresetgo 17 Jul 2025
  
  in Public
  
  I eventually got a blackmail photographer to talk
  
  for - blackmail photographer
  
  blackmail photographer
Visit annotations in context

Tags

blackmail photographer

Annotators

stopresetgo

URL

youtube.com/watch
May 2025
www.niemanlab.org www.niemanlab.org

Anthropic’s new AI model didn’t just “blackmail” researchers in tests — it tried to leak information to news outlets

3
1. stopresetgo 30 May 2025
  
  in Public
  
  Anthropic researchers said this was not an isolated incident, and that Claude had a tendency to “bulk-email media and law-enforcement figures to surface evidence of wrongdoing.”
  
  for - question - progress trap - open source AI models - for blackmail and ransom - Could a bad actor take an open source codebase and twist it to do harm like find out about an rogue AI creator's adversary, enemy or victim and blackmail them? - progress trap - open source AI - criminals - exploit to identify and blackmail victiims
  
  question - progress trap - open source AI models - for blackmail and ransom progress trap - open source AI - criminals - exploit to identify and blackmail victiims
2. stopresetgo 30 May 2025
  
  in Public
  
  for - progress trap - AI - Anthropic Claude 4 - blackmail - from - youtube - Kyle Kilinski Show - AI is completely out of control - https://hyp.is/GhDOzj0nEfCvHZdiUaw4gQ/www.youtube.com/watch?v=4j1gjSoRt8Q
  
  progress trap - AI - Anthropic Claude 4 - blackmail from - youtube - Kyle Kilinski Show - AI is completely out of control
3. stopresetgo 30 May 2025
  
  in Public
  
  The researchers called the behavior “rare” and “difficult to elicit.
  
  for - progress trap - AI - Anthropic Claude 4 - blackmail - rare behavior - but still possible! It only has to happen once!
  
  progress trap - AI - Anthropic Claude 4 - blackmail - rare behavior
Visit annotations in context

Tags

progress trap - AI - Anthropic Claude 4 - blackmail - rare behavior

question - progress trap - open source AI models - for blackmail and ransom

progress trap - AI - Anthropic Claude 4 - blackmail

from - youtube - Kyle Kilinski Show - AI is completely out of control

progress trap - open source AI - criminals - exploit to identify and blackmail victiims

Annotators

stopresetgo

URL

niemanlab.org/2025/05/anthropics-new-ai-model-didnt-just-blackmail-researchers-in-tests-it-tried-to-leak-information-to-news-outlets/
www.youtube.com www.youtube.com

Artificial Intelligence Is Completely Out Of Control | The Kyle Kulinski Show

1
1. stopresetgo 30 May 2025
  
  in Public
  
  anthropic's new AI model shows ability to deceive and blackmail
  
  for - progress trap - AI - blackmail - AI - autonomy - progress trap - AI - Anthropic - Claude Opus 4 - to - article - Anthropic Claude 4 blackmail and news leak - progress trap - AI - article - Anthropic Claude 4 - blackmail - rare behavior - Anthropic’s new AI model didn’t just “blackmail” researchers in tests — it tried to leak information to news outlets
  
  progress trap - AI - blackmail AI - autonomy progress trap - AI - Anthropic - Claude Opus 4 to - article - Anthropic Claude 4 blackmail and news leak
Visit annotations in context

Tags

progress trap - AI - Anthropic - Claude Opus 4

AI - autonomy

progress trap - AI - blackmail

to - article - Anthropic Claude 4 blackmail and news leak

Annotators

stopresetgo

URL

youtube.com/watch

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL