1 Matching Annotations
  1. Last 7 days
    1. We found that NLAs succeeded in this test. An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it.

      这一实验结果表明NLAs能够直接从AI思维中提取隐藏动机,而不需要依赖训练数据分析,为AI审计提供了全新方法,显著提高了检测AI对齐问题的能力。