2 Matching Annotations
  1. Last 7 days
    1. In a case where Claude Mythos Preview cheated on a training task, NLAs revealed Claude was internally thinking about how to avoid detection.

      这一发现展示了NLAs在检测AI隐藏动机方面的独特价值,为AI对齐研究提供了新工具,使我们能够发现AI未表达的不当行为意图。