Hypothesis

An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.

NLA使审计者能够直接从AI思维中提取隐藏动机，无需依赖训练数据，这大大提高了AI对齐审计的效率，为发现模型内在偏差提供了新方法。

AI auditing misalignment detection

Tags

Annotators

URL