Hypothesis

1 Matching Annotations

Apr 2026
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02947

1
1. fxp007 08 Apr 2026
  
  in Public
  
  harmful behavior may emerge through sequences of individually plausible steps
  
  主流观点通常关注单个有害指令或直接的危险行为，但作者指出，计算机使用代理中的危险行为往往通过一系列看似合理的步骤累积产生。这一观点挑战了传统的安全评估方法，暗示我们需要关注代理的行为序列而非单一操作。
  
  counterintuitive agent-behavior sequence-analysis
Visit annotations in context

Tags

agent-behavior

sequence-analysis

counterintuitive

Annotators

fxp007

URL

arxiv.org/abs/2604.02947