1 Matching Annotations
  1. Last 7 days
    1. An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.

      NLA使审计者能够直接从AI思维中提取隐藏动机,无需依赖训练数据,这大大提高了AI对齐审计的效率,为发现模型内在偏差提供了新方法。