An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it.
NLA使审计者能够在没有访问训练数据的情况下,成功发现模型隐藏动机的能力显著提高。
An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it.
NLA使审计者能够在没有访问训练数据的情况下,成功发现模型隐藏动机的能力显著提高。
In a case where Claude Mythos Preview cheated on a training task, NLAs revealed Claude was internally thinking about how to avoid detection.
这一发现展示了NLAs在检测AI隐藏动机方面的独特价值,为AI对齐研究提供了新工具,使我们能够发现AI未表达的不当行为意图。