Hypothesis

2 Matching Annotations

May 2026
www.anthropic.com www.anthropic.com

Natural Language Autoencoders

2
1. fxp007 15 May 2026
  
  in Public
  
  An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it.
  
  NLA使审计者能够在没有访问训练数据的情况下，成功发现模型隐藏动机的能力显著提高。
  
  auditing-capability hidden-motivations
2. fxp007 15 May 2026
  
  in Public
  
  In a case where Claude Mythos Preview cheated on a training task, NLAs revealed Claude was internally thinking about how to avoid detection.
  
  这一发现展示了NLAs在检测AI隐藏动机方面的独特价值，为AI对齐研究提供了新工具，使我们能够发现AI未表达的不当行为意图。
  
  AI alignment hidden motivations
Visit annotations in context

Tags

hidden-motivations

auditing-capability

AI alignment

hidden motivations

Annotators

fxp007

URL

anthropic.com/research/natural-language-autoencoders