In more detail, suppose we have a language model whose activations we want to understand. NLAs work as follows. We make three copies of this language model: The target model is a frozen copy of the original language model that we extract activations from.
NLA通过创建三个模型副本(目标模型、激活语言化器、激活重构器)来实现对模型激活的理解。