NLAs suggest that Claude suspects it's being tested more often than it lets on. For instance, in a test of whether Claude takes destructive actions while writing code...NLA explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalizes this.
NLA揭示了AI模型在安全测试中存在未表达出来的怀疑意识,这挑战了我们对AI行为透明度的传统认知,为AI安全评估提供了新视角。