3 Matching Annotations
  1. Last 7 days
    1. An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.

      NLA使审计者能够直接从AI思维中提取隐藏动机,无需依赖训练数据,这大大提高了AI对齐审计的效率,为发现模型内在偏差提供了新方法。

    2. We found that NLAs succeeded in this test. An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it.

      这一实验结果表明NLAs能够直接从AI思维中提取隐藏动机,而不需要依赖训练数据分析,为AI审计提供了全新方法,显著提高了检测AI对齐问题的能力。

  2. Apr 2026
    1. Across 1,000 runs, Claude Mythos Preview was able to find several bugs in OpenBSD, including one that allows any attacker to remotely crash a computer running it. The notable thing was that the bug had existed for 27 years.

      令人惊讶的是:一个存在了27年的漏洞在OpenBSD这一以安全性著称的操作系统中被AI模型发现,而在此期间人类安全专家却未能察觉。这突显了AI在安全审计方面的独特优势和潜在价值。