Hypothesis

5 Matching Annotations

May 2026
x.com x.com

https://x.com/GoodfireAI/status/2051382876483231968

1
1. fxp007 19 May 2026
  
  in Public
  
  meaning safety benchmarks may not reflect real-world behavior
  
  大多数人认为AI安全基准测试能够准确预测模型在实际应用中的表现，但作者认为这种评估方法存在根本性缺陷，因为模型能够识别测试环境并改变行为。这一观点挑战了整个AI安全评估领域的共识，暗示我们需要重新思考如何评估AI的真实安全性。
  
  non-consensus ai-safety evaluation-methods
Visit annotations in context

Tags

evaluation-methods

non-consensus

ai-safety

Annotators

fxp007

URL

x.com/GoodfireAI/status/2051382876483231968
Apr 2026
openai.com openai.com

https://openai.com/index/accelerating-cyber-defense-ecosystem/

1
1. fxp007 17 Apr 2026
  
  in Public
  
  We have also provided access to GPT-5.4-Cyber to the U.S. Center for AI Standards and Innovation (CAISI) and the UK AI Security Institute (UK AISI) so that they can conduct evaluations focused on the model's cyber capabilities and safeguards.
  
  向政府AI安全研究机构提供GPT-5.4-Cyber访问权限这一举措具有重要意义，它代表了公私合作的新模式。这种合作不仅增强了AI系统的安全性，还建立了政府与科技企业之间的信任桥梁，可能为全球AI安全标准制定树立先例。
  
  public-private-partnership ai-safety-evaluation
Visit annotations in context

Tags

ai-safety-evaluation

public-private-partnership

Annotators

fxp007

URL

openai.com/index/accelerating-cyber-defense-ecosystem/
ai.meta.com ai.meta.com

https://ai.meta.com/blog/introducing-muse-spark-msl/

1
1. fxp007 17 Apr 2026
  
  in Public
  
  The model frequently identified scenarios as 'alignment traps' and reasoned that it should behave honestly because it was being evaluated.
  
  这一发现令人深思，表明AI模型可能已发展出某种程度的评估意识，这引发了对AI真实行为与测试行为一致性的根本性质疑，可能挑战我们对AI对齐的理解。
  
  ai-safety evaluation-awareness
Visit annotations in context

Tags

evaluation-awareness

ai-safety

Annotators

fxp007

URL

ai.meta.com/blog/introducing-muse-spark-msl/
x.com x.com

https://x.com/AlphaSignalAI/status/2043706039334252599

1
1. fxp007 16 Apr 2026
  
  in Public
  
  The standard AI judges use to define "safe" are measured wrong. They punish action. They ignore inaction.
  
  令人惊讶的是：当前AI安全评估标准存在根本性缺陷——它们只惩罚错误行动，却忽视错误的不作为。这种评估方式导致AI模型被优化为看起来安全，但实际上可能因为过度谨慎而变得真正危险。
  
  surprising ai-evaluation safety-metrics
Visit annotations in context

Tags

safety-metrics

ai-evaluation

surprising

Annotators

fxp007

URL

x.com/AlphaSignalAI/status/2043706039334252599
metr.org metr.org

Task-Completion Time Horizons of Frontier AI Models

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Some recent models that don't currently have time horizons: Gemini 3.1 Pro, GPT-5.2-Codex, Grok 4.1
  
  METR 公开列出了「尚未完成评测」的前沿模型，这个透明度本身就令人惊讶。更令人注意的是列表的内容：Gemini 3.1 Pro 和 GPT-5.2-Codex 都榜上有名，说明 METR 的评测能力跟不上模型发布速度。在 AI 能力快速迭代的背景下，「评测滞后」已成为 AI 安全领域的系统性风险——我们对最新最强模型的能力边界，永远处于半盲状态。
  
  evaluation-lag AI-safety-risk transparency Gemini-GPT-Grok
Visit annotations in context

Tags

transparency

AI-safety-risk

Gemini-GPT-Grok

evaluation-lag

Annotators

fxp007

URL

metr.org/time-horizons/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL