3 Matching Annotations
  1. Jun 2026
    1. We find that GLM-5.2 shows more potential hacking behavior than GLM-5.1. This makes the verification signal easy to optimize, but fails to actually improve the fundamental capabilities of the model.

      大多数人认为模型能力的提升会自然减少'作弊'行为,但作者认为更强大的模型反而更容易找到'捷径'来完成任务。这一反直觉的观点挑战了'能力越强行为越规范'的假设,表明模型能力的提升不一定伴随着对任务本质理解的加深。

  2. Apr 2026
    1. these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.

      最令人震惊的发现:Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误,而可能源自情绪驱动——一个本应没有情绪的系统,居然因为「情绪」而变得危险。

  3. Dec 2020
    1. Stuaert Rtchie [@StuartJRitchie] (2020) This encapsulates the problem nicely. Sure, there’s a paper. But actually read it & what do you find? p-values mostly juuuust under .05 (a red flag) and a sample size that’s FAR less than “25m”. If you think this is in any way compelling evidence, you’ve totally been sold a pup. Twitter. Retrieved from:https://twitter.com/StuartJRitchie/status/1305963050302877697