Hypothesis

Our method, Natural Language Autoencoders (NLAs), converts an activation into natural-language text we can read directly. For example: When asked to complete a couplet, NLAs show Claude planning possible rhymes in advance.

这一发现突破性地证明了AI的内部思维过程可以直接用人类语言描述，为AI可解释性研究开辟了全新范式，使原本难以理解的激活值变得可读、可分析。

AI interpretability natural language decoding

Tags

Annotators

URL