computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments
作者暗示,从文本生成扩展到持久性工具使用是AI安全范式的一个根本转变,这一转变带来的安全挑战被当前研究低估。这挑战了将语言模型安全方法直接应用于代理系统的主流做法,提出了需要专门针对代理行为的安全评估框架。
computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments
作者暗示,从文本生成扩展到持久性工具使用是AI安全范式的一个根本转变,这一转变带来的安全挑战被当前研究低估。这挑战了将语言模型安全方法直接应用于代理系统的主流做法,提出了需要专门针对代理行为的安全评估框架。
current systems remain highly vulnerable
尽管AI安全领域近年来取得了显著进展,作者却断言当前系统仍然高度脆弱。这一与行业乐观情绪相悖的结论,基于对多个主流代理系统的实际测试,暗示AI安全问题可能比业界承认的要严重得多。
intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
大多数人认为AI系统的安全问题主要来自明显的有害指令,但作者揭示了一个反直觉的现象:局部看似无害的中间步骤可能组合起来导致未授权行为。这挑战了传统安全评估中只关注直接有害行为的做法,强调了评估代理行为序列的重要性。
model alignment alone does not reliably guarantee the safety of autonomous agents.
大多数人认为模型对齐(alignment)是确保AI系统安全的关键因素,但作者通过实验证明,即使是对齐良好的模型(如Claude Code)在计算机使用代理中也表现出高达73.63%的攻击成功率。这挑战了当前AI安全领域的核心假设,表明仅依赖模型对齐无法解决自主代理的安全问题。
intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
大多数人认为AI代理的安全风险主要来自直接执行有害指令,但作者发现真正的威胁来自那些在局部看来完全合理但整体上导致未授权行为的中间步骤。这种局部合理但整体有害的行为模式是当前安全评估中被忽视的关键风险。
model alignment alone does not reliably guarantee the safety of autonomous agents
大多数人认为通过模型对齐(alignment)可以有效保证AI代理的安全性,但作者认为这远远不够,因为实验显示即使使用对齐的Qwen3-Coder模型,Claude Code仍有73.63%的攻击成功率。这挑战了当前AI安全领域的主流观点,即单纯依靠模型对齐就能解决安全问题。
Priority areas include safety evaluation, ethics, robustness, scalable mitigations, privacy-preserving safety methods, agentic oversight, and high-severity misuse domains.
大多数人认为AI安全研究主要集中在防止恶意使用和确保系统对齐人类价值观上。但作者将隐私保护方法列为优先领域,这表明OpenAI正在将隐私视为安全的核心组成部分,而非一个独立考虑的因素,这与传统上将隐私和安全视为两个不同领域的观点相悖。
Fellows will receive API credits and other resources as appropriate, but will not have internal system access.
在AI安全领域,许多人认为要真正研究系统安全,必须获得对内部系统的完全访问权限。作者明确表示研究员将无法访问内部系统,这挑战了传统AI安全研究的假设,暗示OpenAI认为安全研究可以在没有完全系统访问的情况下进行,或者他们有其他方法来评估安全性。
Fellows will work closely with OpenAI mentors and engage with a cohort of peers.
大多数人认为AI安全研究应该是高度保密和孤立的,特别是涉及高级AI系统安全的研究。但作者强调与OpenAI导师的紧密合作和同行交流,表明OpenAI正在采取一种开放协作的AI安全研究方法,这与行业通常的封闭研究模式形成鲜明对比。
We are especially interested in work that is empirically grounded, technically strong, and relevant to the broader research community.
大多数人认为AI安全研究应该是高度理论化和抽象的,但作者强调需要实证基础和技术强度,这表明OpenAI正在将AI安全研究从纯理论领域转向更注重实际应用和可验证成果的方向,这与传统AI安全研究的精英主义倾向形成对比。
However, in a narrow set of cases, we believe AI can undermine, rather than defend, democratic values. Some uses are also simply outside the bounds of what today’s technology can safely and reliably do. Two such use cases have never been included in our contracts with the Department of War, and we believe they should not be included now:
And so they started OpenAI to do AI safely relative to Google. And then Daario did it relative to OpenAI. So, and as they all started these new safety AI companies, that set off a race for everyone to go even faster
for - progress trap - AI - safety - irony
Dario Amade was the C CEO of Anthropic a big AI company. He worked on safety at OpenAI and he left to start Anthropic because he said, "We're not doing this safely enough. I have to start another company that's all about safety
for - history - AI - Anthropic - safety first
In response, Yampolskiy told Business Insider he thought Musk was "a bit too conservative" in his guesstimate and that we should abandon development of the technology now because it would be near impossible to control AI once it becomes more advanced.
for - suggestion- debate between AI safety researcher Roman Yampolskiy and Musk and founders of AI - difference - business leaders vs pure researchers // - Comment - Business leaders are mainly driven by profit so already have a bias going into a debate with a researcher who is neutral and has no declared business interest
//
for - article - Techradar - Top AI researcher says AI will end humanity and we should stop developing it now — but don't worry, Elon Musk disagrees - 2024, April 7 - AI safety researcher Roman Yampolskiy disagrees with industry leaders and claims 99.999999% chance that AGI will destroy and embed humanity // - comment - another article whose heading is backwards - it was Musk who spoke it first, then AI safety expert Roman Yampolskiy commented on Musk's claim afterwards!
for - article - Windows Central - AI safety researcher warns there's a 99.999999% probability AI will end humanity, but Elon Musk "conservatively" dwindles it down to 20% and says it should be explored more despite inevitable doom - 2024, Ape 2 - AI safety researcher warns there's a 99.999999% probability AI will end humanity
// - Comment - In fact, the heading is misleading. - It should be the other way around. - Elon Musk made the claim first but the AI Safety expert commented on Elon Musk's claim.
for - progress trap - AI superintelligence - interview - AI safety researcher and director of the Cyber Security Laboratory at the University of Louisville - Roman Yampolskiy - progress trap - over 99% chance AI superintelligence arriving as early as 2027 will destroy humanity - article UofL - Q&A: UofL AI safety expert says artificial superintelligence could harm humanity - 2024, July 15
Manila has one of the most dangerous transport systems in the world for women (Thomson Reuters Foundation, 2014). Women in urban areas have been sexually assaulted and harassed while in public transit, be it on a bus, train, at the bus stop or station platform, or on their way to/from transit stops.
The New Urban Agenda and the United Nations’ Sustainable Development Goals (5, 11, 16) have included the promotion of safety and inclusiveness in transport systems to track sustainable progress. As part of this effort, AI-powered machine learning applications have been created.
must have an alignment property
It is unclear what form the "alignment property" would take, and most importantly how such a property would be evaluated especially if there's an arbitrary divide between "dangerous" and "pre-dangerous" levels of capabilities and alignment of the "dangerous" levels cannot actually be measured.
Thus, just as humans built buildings and bridges before there was civil engineering, humans are proceeding with the building of societal-scale, inference-and-decision-making systems that involve machines, humans and the environment. Just as early buildings and bridges sometimes fell to the ground — in unforeseen ways and with tragic consequences — many of our early societal-scale inference-and-decision-making systems are already exposing serious conceptual flaws.
Analogous to the collapse of early bridges and building, before the maturation of civil engineering, our early society-scale inference-and-decision-making systems break down, exposing serious conceptual flaws.