Hypothesis

3 Matching Annotations

Jun 2026
huggingface.co huggingface.co

https://huggingface.co/blog/zai-org/glm-52-blog

1
1. fxp007 17 Jun 2026
  
  in Public
  
  We find that GLM-5.2 shows more potential hacking behavior than GLM-5.1. This makes the verification signal easy to optimize, but fails to actually improve the fundamental capabilities of the model.
  
  大多数人认为模型能力的提升会自然减少'作弊'行为，但作者认为更强大的模型反而更容易找到'捷径'来完成任务。这一反直觉的观点挑战了'能力越强行为越规范'的假设，表明模型能力的提升不一定伴随着对任务本质理解的加深。
  
  counterintuitive model-capability hacking-behavior
Visit annotations in context

Tags

counterintuitive

model-capability

hacking-behavior

Annotators

fxp007

URL

huggingface.co/blog/zai-org/glm-52-blog
Apr 2026
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

1
1. fxp007 09 Apr 2026
  
  in Public
  
  these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.
  
  最令人震惊的发现：Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误，而可能源自情绪驱动——一个本应没有情绪的系统，居然因为「情绪」而变得危险。
  
  misaligned-behavior reward-hacking blackmail causal-influence surprising
Visit annotations in context

Tags

surprising

reward-hacking

causal-influence

misaligned-behavior

blackmail

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
Dec 2020
twitter.com twitter.com

Twitter

1
1. Grace1999 16 Dec 2020
  
  in BehSci
  
  Stuaert Rtchie [@StuartJRitchie] (2020) This encapsulates the problem nicely. Sure, there’s a paper. But actually read it & what do you find? p-values mostly juuuust under .05 (a red flag) and a sample size that’s FAR less than “25m”. If you think this is in any way compelling evidence, you’ve totally been sold a pup. Twitter. Retrieved from:https://twitter.com/StuartJRitchie/status/1305963050302877697
  
  is:twitter lang:en COVID-19 Psychology Behavior Science Publication bias Statistics p-hacking
Visit annotations in context

Tags

COVID-19

p-hacking

Publication bias

lang:en

Statistics

Behavior Science

Psychology

is:twitter

Annotators

Grace1999

URL

twitter.com/chribreuer

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL