Hypothesis

2 Matching Annotations

Apr 2026
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02947

2
1. fxp007 08 Apr 2026
  
  in Public
  
  intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
  
  大多数人认为AI系统的安全问题主要来自明显的有害指令，但作者揭示了一个反直觉的现象：局部看似无害的中间步骤可能组合起来导致未授权行为。这挑战了传统安全评估中只关注直接有害行为的做法，强调了评估代理行为序列的重要性。
  
  non-consensus ai-safety intermediate-actions
2. fxp007 08 Apr 2026
  
  in Public
  
  intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
  
  大多数人认为AI代理的安全风险主要来自直接执行有害指令，但作者发现真正的威胁来自那些在局部看来完全合理但整体上导致未授权行为的中间步骤。这种局部合理但整体有害的行为模式是当前安全评估中被忽视的关键风险。
  
  non-consensus ai-safety intermediate-actions
Visit annotations in context

Tags

ai-safety

intermediate-actions

non-consensus

Annotators

fxp007

URL

arxiv.org/abs/2604.02947