Hypothesis

22 Matching Annotations

Last 7 days
arxiv.org arxiv.org

https://arxiv.org/abs/2604.02947

6
1. fxp007 08 Apr 2026
  
  in Public
  
  computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments
  
  作者暗示，从文本生成扩展到持久性工具使用是AI安全范式的一个根本转变，这一转变带来的安全挑战被当前研究低估。这挑战了将语言模型安全方法直接应用于代理系统的主流做法，提出了需要专门针对代理行为的安全评估框架。
  
  non-consensus ai-paradigm agent-safety
2. fxp007 08 Apr 2026
  
  in Public
  
  current systems remain highly vulnerable
  
  尽管AI安全领域近年来取得了显著进展，作者却断言当前系统仍然高度脆弱。这一与行业乐观情绪相悖的结论，基于对多个主流代理系统的实际测试，暗示AI安全问题可能比业界承认的要严重得多。
  
  counterintuitive ai-safety system-vulnerability
3. fxp007 08 Apr 2026
  
  in Public
  
  intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
  
  大多数人认为AI系统的安全问题主要来自明显的有害指令，但作者揭示了一个反直觉的现象：局部看似无害的中间步骤可能组合起来导致未授权行为。这挑战了传统安全评估中只关注直接有害行为的做法，强调了评估代理行为序列的重要性。
  
  non-consensus ai-safety intermediate-actions
4. fxp007 08 Apr 2026
  
  in Public
  
  model alignment alone does not reliably guarantee the safety of autonomous agents.
  
  大多数人认为模型对齐（alignment）是确保AI系统安全的关键因素，但作者通过实验证明，即使是对齐良好的模型（如Claude Code）在计算机使用代理中也表现出高达73.63%的攻击成功率。这挑战了当前AI安全领域的核心假设，表明仅依赖模型对齐无法解决自主代理的安全问题。
  
  non-consensus ai-safety model-alignment
5. fxp007 08 Apr 2026
  
  in Public
  
  intermediate actions that appear locally acceptable but collectively lead to unauthorized actions
  
  大多数人认为AI代理的安全风险主要来自直接执行有害指令，但作者发现真正的威胁来自那些在局部看来完全合理但整体上导致未授权行为的中间步骤。这种局部合理但整体有害的行为模式是当前安全评估中被忽视的关键风险。
  
  non-consensus ai-safety intermediate-actions
6. fxp007 08 Apr 2026
  
  in Public
  
  model alignment alone does not reliably guarantee the safety of autonomous agents
  
  大多数人认为通过模型对齐(alignment)可以有效保证AI代理的安全性，但作者认为这远远不够，因为实验显示即使使用对齐的Qwen3-Coder模型，Claude Code仍有73.63%的攻击成功率。这挑战了当前AI安全领域的主流观点，即单纯依靠模型对齐就能解决安全问题。
  
  non-consensus ai-safety model-alignment
Visit annotations in context

Tags

system-vulnerability

counterintuitive

ai-paradigm

non-consensus

agent-safety

ai-safety

intermediate-actions

model-alignment

Annotators

fxp007

URL

arxiv.org/abs/2604.02947
openai.com openai.com

https://openai.com/index/introducing-openai-safety-fellowship/

4
1. fxp007 08 Apr 2026
  
  in Public
  
  Priority areas include safety evaluation, ethics, robustness, scalable mitigations, privacy-preserving safety methods, agentic oversight, and high-severity misuse domains.
  
  大多数人认为AI安全研究主要集中在防止恶意使用和确保系统对齐人类价值观上。但作者将隐私保护方法列为优先领域，这表明OpenAI正在将隐私视为安全的核心组成部分，而非一个独立考虑的因素，这与传统上将隐私和安全视为两个不同领域的观点相悖。
  
  non-consensus privacy ai-safety
2. fxp007 08 Apr 2026
  
  in Public
  
  Fellows will receive API credits and other resources as appropriate, but will not have internal system access.
  
  在AI安全领域，许多人认为要真正研究系统安全，必须获得对内部系统的完全访问权限。作者明确表示研究员将无法访问内部系统，这挑战了传统AI安全研究的假设，暗示OpenAI认为安全研究可以在没有完全系统访问的情况下进行，或者他们有其他方法来评估安全性。
  
  non-consensus ai-safety access-control
3. fxp007 08 Apr 2026
  
  in Public
  
  Fellows will work closely with OpenAI mentors and engage with a cohort of peers.
  
  大多数人认为AI安全研究应该是高度保密和孤立的，特别是涉及高级AI系统安全的研究。但作者强调与OpenAI导师的紧密合作和同行交流，表明OpenAI正在采取一种开放协作的AI安全研究方法，这与行业通常的封闭研究模式形成鲜明对比。
  
  non-consensus collaboration ai-safety
4. fxp007 08 Apr 2026
  
  in Public
  
  We are especially interested in work that is empirically grounded, technically strong, and relevant to the broader research community.
  
  大多数人认为AI安全研究应该是高度理论化和抽象的，但作者强调需要实证基础和技术强度，这表明OpenAI正在将AI安全研究从纯理论领域转向更注重实际应用和可验证成果的方向，这与传统AI安全研究的精英主义倾向形成对比。
  
  non-consensus ai-safety empirical-research
Visit annotations in context

Tags

ai-safety

privacy

collaboration

non-consensus

empirical-research

access-control

Annotators

fxp007

URL

openai.com/index/introducing-openai-safety-fellowship/
Mar 2026
www.anthropic.com www.anthropic.com

Statement from Dario Amodei on our discussions with the Department of War

1
1. TylerRick 11 Mar 2026
  
  in Public
  
  However, in a narrow set of cases, we believe AI can undermine, rather than defend, democratic values. Some uses are also simply outside the bounds of what today’s technology can safely and reliably do. Two such use cases have never been included in our contracts with the Department of War, and we believe they should not be included now:
  
  freedom AI: safety mass surveillance Anthropic
Visit annotations in context

Tags

AI: safety

freedom

mass surveillance

Anthropic

Annotators

TylerRick

URL

anthropic.com/news/statement-department-of-war
Nov 2025
www.youtube.com www.youtube.com

AI Expert: We Have 2 Years Before Everything Changes! We Need To Start Protesting! - Tristan Harris

2
1. stopresetgo 28 Nov 2025
  
  in Public
  
  And so they started OpenAI to do AI safely relative to Google. And then Daario did it relative to OpenAI. So, and as they all started these new safety AI companies, that set off a race for everyone to go even faster
  
  for - progress trap - AI - safety - irony
  
  progress trap - AI - safety - irony
2. stopresetgo 28 Nov 2025
  
  in Public
  
  Dario Amade was the C CEO of Anthropic a big AI company. He worked on safety at OpenAI and he left to start Anthropic because he said, "We're not doing this safely enough. I have to start another company that's all about safety
  
  for - history - AI - Anthropic - safety first
  
  history - AI - Anthropic - safety first
Visit annotations in context

Tags

history - AI - Anthropic - safety first

progress trap - AI - safety - irony

Annotators

stopresetgo

URL

youtube.com/watch
Dec 2024
www.techradar.com www.techradar.com

Top AI researcher says AI will end humanity and we should stop developing it now — but don't worry, Elon Musk disagrees

2
1. stopresetgo 27 Dec 2024
  
  in Public
  
  In response, Yampolskiy told Business Insider he thought Musk was "a bit too conservative" in his guesstimate and that we should abandon development of the technology now because it would be near impossible to control AI once it becomes more advanced.
  
  for - suggestion- debate between AI safety researcher Roman Yampolskiy and Musk and founders of AI - difference - business leaders vs pure researchers // - Comment - Business leaders are mainly driven by profit so already have a bias going into a debate with a researcher who is neutral and has no declared business interest
  
  //
  
  suggestion- debate between AI safety researcher Roman Yampolskiy and Musk and founders of AI difference - business leaders vs pure researchers
2. stopresetgo 27 Dec 2024
  
  in Public
  
  for - article - Techradar - Top AI researcher says AI will end humanity and we should stop developing it now — but don't worry, Elon Musk disagrees - 2024, April 7 - AI safety researcher Roman Yampolskiy disagrees with industry leaders and claims 99.999999% chance that AGI will destroy and embed humanity // - comment - another article whose heading is backwards - it was Musk who spoke it first, then AI safety expert Roman Yampolskiy commented on Musk's claim afterwards!
  
  article - Techradar - Top AI researcher says AI will end humanity and we should stop developing it now — but don't worry, Elon Musk disagrees - 2024, April 7 AI safety researcher Roman Yampolskiy disagrees with industry leaders and claims 99.999999% chance that AGI will destroy and embed humanity
Visit annotations in context

Tags

difference - business leaders vs pure researchers

suggestion- debate between AI safety researcher Roman Yampolskiy and Musk and founders of AI

article - Techradar - Top AI researcher says AI will end humanity and we should stop developing it now — but don't worry, Elon Musk disagrees - 2024, April 7

AI safety researcher Roman Yampolskiy disagrees with industry leaders and claims 99.999999% chance that AGI will destroy and embed humanity

Annotators

stopresetgo

URL

techradar.com/pro/top-ai-researcher-says-ai-will-end-humanity-and-we-should-stop-developing-it-now-but-dont-worry-elon-musk-disagrees
www.windowscentral.com www.windowscentral.com

AI safety researcher warns there's a 99.999999% probability AI will end humanity, but Elon Musk "conservatively" dwindles it down to 20% and says it should be explored more despite inevitable doom

1
1. stopresetgo 27 Dec 2024
  
  in Public
  
  for - article - Windows Central - AI safety researcher warns there's a 99.999999% probability AI will end humanity, but Elon Musk "conservatively" dwindles it down to 20% and says it should be explored more despite inevitable doom - 2024, Ape 2 - AI safety researcher warns there's a 99.999999% probability AI will end humanity
  
  // - Comment - In fact, the heading is misleading. - It should be the other way around. - Elon Musk made the claim first but the AI Safety expert commented on Elon Musk's claim.
  
  article - Windows Central - AI safety researcher warns there's a 99.999999% probability AI will end humanity, but Elon Musk "conservatively" dwindles it down to 20% and says it should be explored more despite inevitable doom - 2024, Ape 2 AI safety researcher warns there's a 99.999999% probability AI will end humanity
Visit annotations in context

Tags

article - Windows Central - AI safety researcher warns there's a 99.999999% probability AI will end humanity, but Elon Musk "conservatively" dwindles it down to 20% and says it should be explored more despite inevitable doom - 2024, Ape 2

AI safety researcher warns there's a 99.999999% probability AI will end humanity

Annotators

stopresetgo

URL

windowscentral.com/software-apps/ai-safety-researcher-warns-theres-a-99999999-probability-ai-will-end-humanity-but-elon-musk-conservatively-dwindles-it-down-to-20-and-says-it-should-be-explored-more-despite-inevitable-doom
louisville.edu louisville.edu

Q&A: UofL AI safety expert says artificial superintelligence could harm humanity

1
1. stopresetgo 27 Dec 2024
  
  in Public
  
  for - progress trap - AI superintelligence - interview - AI safety researcher and director of the Cyber Security Laboratory at the University of Louisville - Roman Yampolskiy - progress trap - over 99% chance AI superintelligence arriving as early as 2027 will destroy humanity - article UofL - Q&A: UofL AI safety expert says artificial superintelligence could harm humanity - 2024, July 15
  
  progress trap - over 99% chance AI superintelligence arriving as early as 2027 will destroy humanity article UofL - Q&A: UofL AI safety expert says artificial superintelligence could harm humanity - 2024, July 15 progress trap - AI superintelligence - interview - AI safety researcher and director of the Cyber Security Laboratory at the University of Louisville - Roman Yampolskiy
Visit annotations in context

Tags

article UofL - Q&A: UofL AI safety expert says artificial superintelligence could harm humanity - 2024, July 15

progress trap - over 99% chance AI superintelligence arriving as early as 2027 will destroy humanity

progress trap - AI superintelligence - interview - AI safety researcher and director of the Cyber Security Laboratory at the University of Louisville - Roman Yampolskiy

Annotators

stopresetgo

URL

louisville.edu/news/qa-uofl-ai-safety-expert-says-artificial-superintelligence-could-harm-humanity
Nov 2024
arstechnica.com arstechnica.com

GPT-4 will hunt for trends in medical records thanks to Microsoft and Epic

1
1. TylerRick 05 Nov 2024
  
  in Public
  
  AI: confabulation AI: safety
Visit annotations in context

Tags

AI: safety

AI: confabulation

Annotators

TylerRick

URL

arstechnica.com/information-technology/2023/04/gpt-4-will-hunt-for-trends-in-medical-records-thanks-to-microsoft-and-epic/
Aug 2024
feministai.pubpub.org feministai.pubpub.org

SafeHer Transit: What Women Want in their AI Powered Safety App

1
1. Marcoguzman2024 22 Aug 2024
  
  in Public
  
  Manila has one of the most dangerous transport systems in the world for women (Thomson Reuters Foundation, 2014). Women in urban areas have been sexually assaulted and harassed while in public transit, be it on a bus, train, at the bus stop or station platform, or on their way to/from transit stops.
  
  The New Urban Agenda and the United Nations’ Sustainable Development Goals (5, 11, 16) have included the promotion of safety and inclusiveness in transport systems to track sustainable progress. As part of this effort, AI-powered machine learning applications have been created.
  
  AI as safety tool in transport system
Visit annotations in context

Tags

AI as safety tool in transport system

Annotators

Marcoguzman2024

URL

feministai.pubpub.org/pub/lfj1tcme
Sep 2023
www.theguardian.com www.theguardian.com

Mushroom pickers urged to avoid foraging books on Amazon that appear to be written by AI

1
1. chrisaldrich 07 Sep 2023
  
  in Public
  
  https://www.theguardian.com/technology/2023/sep/01/mushroom-pickers-urged-to-avoid-foraging-books-on-amazon-that-appear-to-be-written-by-ai
  
  mushrooms read foraging rise of the bots chatbots generative AI foragers not forgeries hallucinating artificial intelligence for writing artificial intelligence safety
Visit annotations in context

Tags

foraging

artificial intelligence for writing

rise of the bots

chatbots

foragers not forgeries

hallucinating

generative AI

mushrooms

artificial intelligence safety

read

Annotators

chrisaldrich

URL

theguardian.com/technology/2023/sep/01/mushroom-pickers-urged-to-avoid-foraging-books-on-amazon-that-appear-to-be-written-by-ai
May 2023
www.lesswrong.com www.lesswrong.com

AGI Ruin: A List of Lethalities - LessWrong

1
1. 5ol 18 May 2023
  
  in Public
  
  must have an alignment property
  
  It is unclear what form the "alignment property" would take, and most importantly how such a property would be evaluated especially if there's an arbitrary divide between "dangerous" and "pre-dangerous" levels of capabilities and alignment of the "dangerous" levels cannot actually be measured.
  
  ai safety ethics agi
Visit annotations in context

Tags

agi

safety

ethics

ai

Annotators

5ol

URL

lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
Dec 2020
medium.com medium.com

How To Quit — Assist Stories — Medium

1
1. jessems 25 Dec 2020
  
  in Public
  
  Thus, just as humans built buildings and bridges before there was civil engineering, humans are proceeding with the building of societal-scale, inference-and-decision-making systems that involve machines, humans and the environment. Just as early buildings and bridges sometimes fell to the ground — in unforeseen ways and with tragic consequences — many of our early societal-scale inference-and-decision-making systems are already exposing serious conceptual flaws.
  
  Analogous to the collapse of early bridges and building, before the maturation of civil engineering, our early society-scale inference-and-decision-making systems break down, exposing serious conceptual flaws.
  
  AI Safety
Visit annotations in context

Tags

AI Safety

Annotators

jessems

URL

medium.com/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL