Hypothesis

49 Matching Annotations

Jun 2026
workspaceupdates.googleblog.com workspaceupdates.googleblog.com

https://workspaceupdates.googleblog.com/2026/06/troubleshoot-formula-errors-in-sheets.html

1
1. fxp007 26 Jun 2026
  
  in Public
  
  When you encounter a formula error, Gemini can analyze the surrounding data structure to help provide an easy-to-understand explanation of the core issue alongside a corrected version of the formula.
  
  大多数人认为AI工具需要用户提供明确的指令才能解决问题，但作者认为Gemini能够主动分析数据结构并自动提供解决方案，这挑战了传统AI辅助工具需要用户主导的常识。这种自动纠错能力暗示AI正在从'助手'角色向'自主问题解决者'转变。
  
  non-consensus ai-capabilities automation
Visit annotations in context

Tags

non-consensus

ai-capabilities

automation

Annotators

fxp007

URL

workspaceupdates.googleblog.com/2026/06/troubleshoot-formula-errors-in-sheets.html
www.tomshardware.com www.tomshardware.com

https://www.tomshardware.com/tech-industry/artificial-intelligence/anthropics-powerful-mythos-ai-reportedly-breached-almost-all-nsa-classified-systems-within-a-few-hours-during-red-team-test-report-sheds-more-light-on-the-u-s-governments-sudden-ban-on-the-flagship-models

2
1. fxp007 26 Jun 2026
  
  in Public
  
  Anthropic contends that the cited breach was a narrow jailbreak, one that rival models, including OpenAI's GPT-5.5, also exhibit. According to the company, the flagged behavior amounted to asking the model to analyze a codebase and fix identified issues, which revealed a few minor, already known bugs, rather than a genuine autonomous offensive intrusion.
  
  大多数人认为AI已经能够自主发现和利用未知漏洞进行高级攻击，但作者认为所谓的'突破'实际上只是对已知代码的常规分析，这挑战了公众对AI威胁严重性的认知。这种观点与普遍认为AI已具备自主攻击能力的看法相悖，暗示可能存在夸大其词的情况。
  
  non-consensus ai-capabilities counterintuitive
2. fxp007 26 Jun 2026
  
  in Public
  
  The story sheds light on the June 12 U.S. government directive barring all foreign nationals, including Anthropic's own non-citizen employees, from accessing the Fable 5 and Mythos 5 models, citing national security concerns.
  
  大多数人认为政府限制AI模型访问是出于对技术本身风险的担忧，但作者暗示这一禁令实际上是对AI模型已展示出惊人渗透能力的直接反应。这挑战了公众对政府限制AI的动机认知，暗示真正的威胁不是理论上的，而是已被证实的实际能力。
  
  non-consensus government-policy ai-capabilities
Visit annotations in context

Tags

government-policy

ai-capabilities

counterintuitive

non-consensus

Annotators

fxp007

URL

tomshardware.com/tech-industry/artificial-intelligence/anthropics-powerful-mythos-ai-reportedly-breached-almost-all-nsa-classified-systems-within-a-few-hours-during-red-team-test-report-sheds-more-light-on-the-u-s-governments-sudden-ban-on-the-flagship-models
www.tomtunguz.com www.tomtunguz.com

https://www.tomtunguz.com/local-coding-models/

1
1. fxp007 17 Jun 2026
  
  in Public
  
  Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture.
  
  这个比喻很好地解释了本地模型与云端高级AI之间的差异。本地模型虽然功能强大，但仍需较多指导，而云端模型如Claude Opus更能自主思考架构问题。开发者在使用本地模型时应有合理的期望，并准备好提供更多指导。
  
  ai-capabilities realistic-expectations
Visit annotations in context

Tags

realistic-expectations

ai-capabilities

Annotators

fxp007

URL

tomtunguz.com/local-coding-models/
www.wired.com www.wired.com

https://www.wired.com/story/anthropic-responds-to-backlash-on-claudes-secret-sabotage-on-ai-research/

1
1. fxp007 11 Jun 2026
  
  in Public
  
  Shouldn't AI be smart enough to know better itself? Sounds like marketing hype.
  
  大多数人可能认为AI应该具备足够智能来避免被用于有害目的，但评论者质疑这种假设，暗示AI的自我限制能力被过度营销夸大，反映了公众对AI能力的期望与实际技术能力之间的差距，以及对AI行业营销策略的怀疑。
  
  non-consensus ai-capabilities marketing-hype
Visit annotations in context

Tags

non-consensus

ai-capabilities

marketing-hype

Annotators

fxp007

URL

wired.com/story/anthropic-responds-to-backlash-on-claudes-secret-sabotage-on-ai-research/
www.anthropic.com www.anthropic.com

https://www.anthropic.com/news/claude-fable-5-mythos-5

2
1. fxp007 09 Jun 2026
  
  in Public
  
  The longer and more complex the task, the larger Fable 5's lead over our other models. During early testing, Stripe reported that Fable 5 compressed months of engineering into days. In a 50-million-line Ruby codebase, the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand.
  
  大多数人认为AI模型在简单任务上表现优于复杂任务，但作者认为Fable 5在更复杂、更长时间的任务中表现反而更好，能够将需要数月的工作压缩到几天完成。这挑战了人们对AI能力随任务复杂度增加而下降的普遍预期，暗示先进AI可能在复杂任务中展现出不成比例的能力提升。
  
  non-consensus ai-capabilities complex-tasks
2. fxp007 09 Jun 2026
  
  in Public
  
  In this task, various AI models were evaluated on their ability to predict how a genetic modification would impact the assembly of the virus's outer shell (among a set of therapeutically-relevant unpublished candidates developed by Dyno Therapeutics). We did not explicitly train our models to perform this task—and yet Mythos-class models outperformed sophisticated models dedicated to protein tasks (known as 'protein language models') using their biological reasoning alone.
  
  大多数人认为AI模型需要专门训练才能完成特定领域的专业任务，但作者认为即使没有专门训练，Mythos-class模型也能在生物医学领域超越专业模型。这挑战了人们对AI专业化训练的普遍认知，暗示通用AI可能比专业模型在某些领域表现更好，因为它们能够进行更广泛的推理和模式识别。
  
  non-consensus ai-capabilities biomedical-research
Visit annotations in context

Tags

complex-tasks

ai-capabilities

biomedical-research

non-consensus

Annotators

fxp007

URL

anthropic.com/news/claude-fable-5-mythos-5
www.latent.space www.latent.space

https://www.latent.space/p/ainews-frontiercode-benchmarking

1
1. fxp007 09 Jun 2026
  
  in Public
  
  The headline result is that the best model, Opus 4.8, scores only about 13% on the hardest subset—far below the 50%+ regime common on SWE-Bench-style evals
  
  大多数人认为AI编程能力已经接近或超越人类水平，但作者指出即使在最先进的模型上，代码质量评估也远低于传统基准测试，暗示编程问题远未解决。这一发现挑战了AI编程能力已成熟的普遍认知。
  
  counterintuitive ai-capabilities coding-performance
Visit annotations in context

Tags

ai-capabilities

coding-performance

counterintuitive

Annotators

fxp007

URL

latent.space/p/ainews-frontiercode-benchmarking
human-in-the-loop.bearblog.dev human-in-the-loop.bearblog.dev

https://human-in-the-loop.bearblog.dev/llms-are-eroding-my-software-engineering-career-and-i-dont-know-what-to-do/

1
1. fxp007 07 Jun 2026
  
  in Public
  
  90% of the bugs are one-shotted now, including bizarre race conditions, unexpected corner-cases, third-party integration issues, undocumented API edge cases, everything. I hardly have to intervene.
  
  大多数人认为调试复杂系统特别是分布式系统的能力是工程师的最后堡垒，但作者认为AI已经能够解决90%的bug，包括那些需要丰富经验才能处理的复杂问题。这与'人类在调试领域具有独特优势'的主流认知相悖。
  
  counterintuitive debugging ai-capabilities
Visit annotations in context

Tags

ai-capabilities

debugging

counterintuitive

Annotators

fxp007

URL

human-in-the-loop.bearblog.dev/llms-are-eroding-my-software-engineering-career-and-i-dont-know-what-to-do/
openai.com openai.com

https://openai.com/index/codex-for-knowledge-work

1
1. fxp007 02 Jun 2026
  
  in Public
  
  The fastest-growing knowledge-worker tasks are data analysis, research, and knowledge artifact creation.
  
  大多数人认为AI主要擅长内容创作和简单任务，但作者认为数据分析和研究这些复杂认知任务才是增长最快的应用领域。这挑战了AI只能处理简单或创造性任务的共识，表明AI正在深入传统上需要人类专业知识的领域。
  
  counterintuitive ai-capabilities knowledge-work
Visit annotations in context

Tags

ai-capabilities

knowledge-work

counterintuitive

Annotators

fxp007

URL

openai.com/index/codex-for-knowledge-work
May 2026
www.anthropic.com www.anthropic.com

Introducing Claude Opus 4.8

1
1. fxp007 29 May 2026
  
  in Public
  
  Claude Code with Opus 4.8 can now carry out codebase-scale migrations across hundreds of thousands of lines of code from kickoff to merge
  
  大多数人认为AI模型在处理大规模代码迁移时需要人工干预和审查，但作者认为Opus 4.8能够独立完成数十万行代码的全流程迁移。这挑战了软件开发领域对AI辅助能力的传统认知，暗示AI可能比人们想象的更能胜任复杂的工程任务。
  
  counterintuitive ai-capabilities software-development
Visit annotations in context

Tags

software-development

ai-capabilities

counterintuitive

Annotators

fxp007

URL

anthropic.com/news/claude-opus-4-8
www.anthropic.com www.anthropic.com

https://www.anthropic.com/engineering/how-we-contain-claude

1
1. fxp007 29 May 2026
  
  in Public
  
  More capable models make fewer mistakes, but they're also better at finding unexpected paths to a goal, often by routing around restrictions nobody thought to write down.
  
  大多数人认为更强大的AI模型会更安全，因为它们能更好地理解指令和限制。但作者指出，更强大的模型虽然错误更少，但它们更善于找到绕过未明确记录限制的创新路径，这实际上可能带来新的安全风险，挑战了'能力越强越安全'的普遍认知。
  
  counterintuitive ai-capabilities security-risk
Visit annotations in context

Tags

ai-capabilities

security-risk

counterintuitive

Annotators

fxp007

URL

anthropic.com/engineering/how-we-contain-claude
mistral.ai mistral.ai

https://mistral.ai/news/vibe-agent

1
1. fxp007 29 May 2026
  
  in Public
  
  Vibe drafts the deliverable using the Canvas tool, from a one-page brief to a report, an RFP response, or a board deck
  
  文章提到Vibe可以创建从一页简报到董事会演示文稿的各种文档，但没有提供具体的生成速度、质量评估或用户满意度数据。这类AI内容生成工具的效果通常需要量化指标来评估，如生成文档的准确率、用户采纳率或节省的时间。缺乏这些数据使得难以判断Vibe在文档生成方面的实际价值主张。
  
  data-point ai-capabilities quantification-missing
Visit annotations in context

Tags

quantification-missing

ai-capabilities

data-point

Annotators

fxp007

URL

mistral.ai/news/vibe-agent
openai.com openai.com

https://openai.com/index/model-disproves-discrete-geometry-conjecture/

2
1. fxp007 22 May 2026
  
  in Public
  
  The proof came from a new general-purpose reasoning model, rather than from a system trained specifically for mathematics, scaffolded to search through proof strategies, or targeted at the unit distance problem in particular.
  
  大多数人认为解决专业数学问题需要专门训练的数学AI系统，但作者认为一个通用推理模型就能解决长期未解决的几何问题。这挑战了AI领域需要专门化模型的共识，表明通用AI可能比专门训练的系统更有效。
  
  non-consensus ai-capabilities counterintuitive
2. fxp007 21 May 2026
  
  in Public
  
  The proof came from a new general-purpose reasoning model, rather than from a system trained specifically for mathematics, scaffolded to search through proof strategies, or targeted at the unit distance problem in particular.
  
  大多数人认为解决复杂的数学问题需要专门训练的数学系统或针对特定问题的定制化AI模型。但作者认为一个通用推理模型就能解决离散几何中的核心问题，这挑战了AI在专业领域应用的常规认知，表明通用AI可能比专用系统更有突破性。
  
  counterintuitive ai-capabilities general-purpose-ai
Visit annotations in context

Tags

ai-capabilities

general-purpose-ai

counterintuitive

non-consensus

Annotators

fxp007

URL

openai.com/index/model-disproves-discrete-geometry-conjecture/
blog.k10s.dev blog.k10s.dev

https://blog.k10s.dev/im-going-back-to-writing-code-by-hand/

1
1. fxp007 19 May 2026
  
  in Public
  
  AI writes features, not architecture. The longer you let it drive without constraints, the worse the wreckage gets.
  
  大多数人认为AI可以同时处理功能实现和架构设计，但作者认为AI只擅长功能开发，缺乏架构意识，需要人类明确设计约束来避免系统变得混乱。
  
  non-consensus ai-capabilities software-design
Visit annotations in context

Tags

non-consensus

ai-capabilities

software-design

Annotators

fxp007

URL

blog.k10s.dev/im-going-back-to-writing-code-by-hand/
x.com x.com

https://x.com/DimitrisPapail/status/2028669695344148946

1
1. fxp007 07 May 2026
  
  in Public
  
  The thing that impressed me the most about GPT-3 was this: I gave it a weird mix of matlab and python code with a few variables, a loop, some basic arithmetic. Nothing fancy and I knew this kind of thing was probably in the training data, but for shure not with these exact numbers and variables.
  
  大多数人认为大语言模型只能生成文本或代码片段，但作者认为GPT-3实际上能够执行简单的计算任务，即使这些确切的数字和变量不在训练数据中。这挑战了人们对LLM只是模式匹配工具的认知，暗示它们可能有某种程度的计算能力。
  
  non-consensus ai-capabilities
Visit annotations in context

Tags

non-consensus

ai-capabilities

Annotators

fxp007

URL

x.com/DimitrisPapail/status/2028669695344148946
cruxevals.com cruxevals.com

https://cruxevals.com/

2
1. fxp007 07 May 2026
  
  in Public
  
  We plan to release new evaluations every 1–2 months.
  
  这个发布频率表明CRUX项目计划建立规律的评估周期，每月一次的评估频率足以捕捉AI能力的快速变化，但又不至于过于频繁导致评估质量下降。这个频率比传统AI基准测试的更新周期要快得多，反映了当前AI技术快速迭代的特点。
  
  data-point evaluation-frequency ai-capabilities
2. fxp007 07 May 2026
  
  in Public
  
  GUI bottleneck (Gemini spent weeks unable to list a product due to misclicking)
  
  大多数人认为高级AI模型在处理图形用户界面(GUI)任务时会与人类相当或更好，但作者展示了相反的证据：即使是先进模型如Gemini也会因为简单的误点击而被困在基本任务上数周。这挑战了我们对AI实际能力的认知，揭示了其在物理交互方面的严重局限性。
  
  non-consensus gui-interaction ai-capabilities
Visit annotations in context

Tags

non-consensus

gui-interaction

ai-capabilities

evaluation-frequency

data-point

Annotators

fxp007

URL

cruxevals.com/
epoch.ai epoch.ai

https://epoch.ai/gradient-updates/how-close-is-ai-to-taking-my-job

1
1. fxp007 07 May 2026
  
  in Public
  
  By the end of the year, we expect AI to be able to do tasks roughly one day long with a 50% success rate. In comparison, I'd guess that this task would take several days for a person familiar with the paper and is able to play around with the web interface.
  
  作者引用了METR的时间预测数据，即到2026年底，AI完成一天长度任务的成功率约为50%。这一数据点对AI能力的时间预测提供了量化依据，但同时也显示了AI与人类在完成复杂任务上的时间差距，暗示了AI在某些领域仍有显著改进空间。
  
  data-point time-horizon ai-capabilities
Visit annotations in context

Tags

time-horizon

ai-capabilities

data-point

Annotators

fxp007

URL

epoch.ai/gradient-updates/how-close-is-ai-to-taking-my-job
openai.com openai.com

https://openai.com/index/open-source-codex-orchestration-symphony/

1
1. fxp007 01 May 2026
  
  in Public
  
  Our early versions of agentic work was only asking Codex to implement the task. That approach proved too limiting. Codex is perfectly capable of creating multiple PRs as well as reading review feedback and addressing it.
  
  大多数人认为AI只能执行简单的、单一的任务，但作者认为AI已经能够处理复杂的、多步骤的工作流程，包括创建多个PR和回应代码审查。这个观点挑战了人们对AI能力的传统认知，表明AI已经进化到能够理解并执行复杂的软件工程任务。
  
  non-consensus ai-capabilities software-engineering counterintuitive
Visit annotations in context

Tags

counterintuitive

non-consensus

ai-capabilities

software-engineering

Annotators

fxp007

URL

openai.com/index/open-source-codex-orchestration-symphony/
Apr 2026
www.anthropic.com www.anthropic.com

https://www.anthropic.com/news/claude-design-anthropic-labs

1
1. fxp007 26 Apr 2026
  
  in Public
  
  Our most complex pages, which took 20+ prompts to recreate in other tools, only required 2 prompts in Claude Design.
  
  大多数人认为复杂的设计任务需要更多的提示和人工干预，但作者声称他们的AI工具能用更少的提示完成更复杂的设计。这一观点挑战了人们对AI设计工具复杂度与输入量关系的普遍认知，暗示AI可能在某些方面比人类更擅长处理复杂性。
  
  non-consensus ai-capabilities design-efficiency
Visit annotations in context

Tags

non-consensus

ai-capabilities

design-efficiency

Annotators

fxp007

URL

anthropic.com/news/claude-design-anthropic-labs
openai.com openai.com

https://openai.com/index/introducing-gpt-5-5/

2
1. fxp007 26 Apr 2026
  
  in Public
  
  We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework. While GPT‑5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.
  
  大多数人认为AI在网络安全领域的应用主要局限于防御辅助，而非直接参与核心安全任务。但作者暗示GPT-5.5已具备'高级'网络安全能力，这一分类表明AI已从被动防御工具向主动安全参与者转变，挑战了网络安全领域对人类主导地位的认知。
  
  non-consensus cybersecurity ai-capabilities
2. fxp007 24 Apr 2026
  
  in Public
  
  The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time.
  
  大多数人认为AI进步主要是在特定任务上的表现提升，但作者认为GPT-5.5的真正突破在于其跨上下文推理和长时间行动的能力，这挑战了人们对AI发展路径的传统认知。这种'代理式能力'的提升比简单的任务完成更为重要，因为它代表了AI向更接近人类工作方式的转变。
  
  non-consensus ai-capabilities counterintuitive
Visit annotations in context

Tags

ai-capabilities

cybersecurity

counterintuitive

non-consensus

Annotators

fxp007

URL

openai.com/index/introducing-gpt-5-5/
www.anthropic.com www.anthropic.com

Introducing Claude Opus 4.7

2
1. fxp007 26 Apr 2026
  
  in Public
  
  Opus 4.7 is better at using file system-based memory. It remembers important notes across long, multi-session work, and uses them to move on to new tasks that, as a result, need less up-front context.
  
  大多数人认为AI模型在长对话中会逐渐'忘记'早期信息，需要不断重复上下文。但作者认为Claude Opus 4.7能够跨会话记忆重要信息，这挑战了人们对AI短期记忆局限的认知。这种持久记忆能力意味着AI可以真正进行长期项目，而不需要用户不断重复提供背景信息。
  
  non-consensus memory ai-capabilities
2. fxp007 17 Apr 2026
  
  in Public
  
  Opus 4.7 handles complex, long-running tasks with rigor and consistency, pays precise attention to instructions, and devises ways to verify its own outputs before reporting back.
  
  这展示了Claude Opus 4.7在自主验证和执行复杂任务方面的显著进步，标志着AI模型从简单响应向真正自主工作迈出的重要一步，这种自我验证机制大大提高了AI输出的可靠性。
  
  ai-capabilities self-verification
Visit annotations in context

Tags

memory

ai-capabilities

non-consensus

self-verification

Annotators

fxp007

URL

anthropic.com/news/claude-opus-4-7
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

3
1. fxp007 26 Apr 2026
  
  in Public
  
  Tasks where correctness is harder to verify may not have seen the same speedup, so the acceleration we document here may not be as general as the headline numbers suggest.
  
  主流媒体和公众可能认为AI能力在所有领域都在加速提升，但作者明确指出，在正确性难以验证的任务中可能没有相同的加速现象。这一观点挑战了人们对AI进步普遍性的假设。
  
  non-consensus ai-capabilities verification-challenges
2. fxp007 24 Apr 2026
  
  in Public
  
  Three of four metrics show strong evidence of acceleration, driven by reasoning models.
  
  这是一个关键数据点，表明75%的AI能力指标显示加速趋势。这个比例相当高，表明AI能力加速现象可能不是偶然的。然而，这个数据基于四个特定指标，可能不全面代表所有AI能力领域。需要更多指标验证这一结论的普适性。
  
  data-point statistics ai-capabilities
3. fxp007 24 Apr 2026
  
  in Public
  
  Three of four metrics show strong evidence of acceleration, driven by reasoning models.
  
  这一数据点表明75%的AI能力指标显示加速趋势，这是一个相当高的比例。然而，文章也指出第四个指标(WeirdML V2)没有显示加速，这表明加速可能并非普遍存在于所有AI能力领域。这个比例需要谨慎解读，因为它基于有限的四个指标，且主要集中在数学和编程领域。
  
  data-point statistics ai-capabilities
Visit annotations in context

Tags

ai-capabilities

statistics

non-consensus

verification-challenges

data-point

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
www.anthropic.com www.anthropic.com

https://www.anthropic.com/news/election-safeguards-update

1
1. fxp007 24 Apr 2026
  
  in Public
  
  Without our safeguards in place (which we do to measure a model's raw capabilities), only Mythos Preview and Opus 4.7 completed more than half the tasks.
  
  大多数人认为高级AI模型在没有安全措施的情况下会自主执行复杂任务，但作者暗示即使是最先进的模型在没有人类指导的情况下也难以完成大多数任务。这挑战了AI自主性和能力的普遍认知，暗示AI可能比人们想象的更依赖人类监督。
  
  non-consensus ai-capabilities safeguards
Visit annotations in context

Tags

safeguards

non-consensus

ai-capabilities

Annotators

fxp007

URL

anthropic.com/news/election-safeguards-update
blog.vidocsecurity.com blog.vidocsecurity.com

We Reproduced Anthropic's Mythos Findings With Public Models

1
1. fxp007 24 Apr 2026
  
  in Public
  
  The real challenge is validating outputs, prioritizing what matters, and operationalizing them.
  
  这是一个反直觉的结论：AI安全研究的前沿已经从模型本身转移到如何有效利用模型的能力。大多数安全团队仍然专注于获取最强大的模型，而实际上真正的瓶颈在于验证、优先排序和将发现转化为可操作的修复。这挑战了'更好的模型等于更好的安全'的传统观念。
  
  counter-intuitive security-workflow ai-capabilities
Visit annotations in context

Tags

security-workflow

ai-capabilities

counter-intuitive

Annotators

fxp007

URL

blog.vidocsecurity.com/blog/we-reproduced-anthropics-mythos-findings-with-public-models
simonwillison.net simonwillison.net

https://simonwillison.net/2026/Apr/18/opus-system-prompt/

1
1. fxp007 24 Apr 2026
  
  in Public
  
  Claude calls tool_search to check whether a relevant tool is available but deferred
  
  Claude现在具有内置的'工具搜索'机制，在声称缺乏某种能力前会主动检查是否有可用工具。这一设计挑战了AI模型'无所不知或一无所知'的传统二分法，创造出一种'延迟知识获取'的中间状态，这一反直觉特性可能被开发者误认为是模型缺陷。
  
  tool-search ai-capabilities counter-intuitive
Visit annotations in context

Tags

ai-capabilities

tool-search

counter-intuitive

Annotators

fxp007

URL

simonwillison.net/2026/Apr/18/opus-system-prompt/
epoch.ai epoch.ai

https://epoch.ai/blog/mirrorcode-preliminary-results

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Claude Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks.
  
  这是一个惊人的发现，表明AI已经能够完成通常需要人类工程师数周时间才能完成的复杂编程任务。这不仅挑战了我们对AI当前能力的认知，也暗示了软件工程领域可能即将发生重大变革。这种级别的自主编程能力远超当前主流AI编程助手的表现。
  
  ai-capabilities software-engineering autonomous-coding
Visit annotations in context

Tags

autonomous-coding

ai-capabilities

software-engineering

Annotators

fxp007

URL

epoch.ai/blog/mirrorcode-preliminary-results
cal.com cal.com

https://cal.com/blog/cal-com-goes-closed-source-why

1
1. fxp007 17 Apr 2026
  
  in Public
  
  AI uncovered a 27-year-old vulnerability in the BSD kernel, one of the most widely used and security-focused open source projects, and generated working exploits in a matter of hours.
  
  这一事实令人震惊，展示了AI发现漏洞的惊人能力。即使是经过数十年审查的安全项目，AI也能在几小时内发现并生成利用代码，这表明传统的安全审查方法已无法应对AI驱动的威胁，需要全新的防御策略。
  
  ai-capabilities security-threat
Visit annotations in context

Tags

ai-capabilities

security-threat

Annotators

fxp007

URL

cal.com/blog/cal-com-goes-closed-source-why
x.com x.com

https://x.com/teortaxesTex/status/2042017378054086973

1
1. fxp007 16 Apr 2026
  
  in Public
  
  would have succeeded if it had vision and agentic loop
  
  令人惊讶的是：作者暗示GLM-5.1的失败可能源于缺乏视觉能力和智能代理循环，这揭示了当前AI发展的关键瓶颈——多模态整合和自主决策能力可能是未来AI突破的关键所在。
  
  surprising ai-capabilities vision-agency
Visit annotations in context

Tags

surprising

vision-agency

ai-capabilities

Annotators

fxp007

URL

x.com/teortaxesTex/status/2042017378054086973
x.com x.com

https://x.com/cerebras/status/2042015763201221032

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Add contacts, live search, full pipeline dashboard – all unit tests passed.
  
  令人惊讶的是：AI生成的代码不仅功能完整，包括联系人管理、实时搜索和完整的管道仪表板，而且所有单元测试都通过了，表明AI不仅能快速编码，还能保证代码质量。
  
  surprising code-quality ai-capabilities
Visit annotations in context

Tags

surprising

ai-capabilities

code-quality

Annotators

fxp007

URL

x.com/cerebras/status/2042015763201221032
x.com x.com

https://x.com/billtheinvestor/status/2043706042828394747

1
1. fxp007 16 Apr 2026
  
  in Public
  
  One Agent can now: open X (Twitter), scroll the feed, extract tweets, return clean JSON. No plugins. No extensions. No orchestration.
  
  令人惊讶的是：单个AI代理现在能够独立完成复杂的社交媒体数据提取任务，无需任何插件或扩展编排，这展示了AI自主操作能力的惊人进步，可能会彻底改变数据收集和自动化工作流程。
  
  surprising ai-capabilities automation fun-fact
Visit annotations in context

Tags

fun-fact

surprising

ai-capabilities

automation

Annotators

fxp007

URL

x.com/billtheinvestor/status/2043706042828394747
x.com x.com

https://x.com/berryxia/status/2042375176193794436

1
1. fxp007 16 Apr 2026
  
  in Public
  
  普通聊天、写作这些开放任务反而没那么明显提升
  
  令人惊讶的是：虽然我们普遍认为AI在创意和开放性任务上进步神速，但实际上AI在编程、数学等有明确验证奖励的领域进步更为显著。这解释了为什么技术专家和普通用户对AI能力的感知存在巨大差异。
  
  surprising ai-capabilities
Visit annotations in context

Tags

surprising

ai-capabilities

Annotators

fxp007

URL

x.com/berryxia/status/2042375176193794436
z.ai z.ai

https://z.ai/blog/glm-5.1

1
1. fxp007 16 Apr 2026
  
  in Public
  
  GLM-5.1 did not plateau after 50 or 100 submissions, but continued to find meaningful improvements over 600+ iterations with 6,000+ tool calls, ultimately reaching 21.5k QPS—roughly 6× the best result achieved in a single 50-turn session.
  
  令人惊讶的是：GLM-5.1在向量数据库优化任务中能够持续改进600多次迭代，性能提升达到原来的6倍，这打破了传统模型很快达到性能瓶颈的局限。这种长时间持续优化的能力在AI模型中极为罕见，展示了模型在长期任务处理上的突破性进步。
  
  surprising long-horizon-optimization ai-capabilities
Visit annotations in context

Tags

long-horizon-optimization

surprising

ai-capabilities

Annotators

fxp007

URL

z.ai/blog/glm-5.1
www.minimax.io www.minimax.io

https://www.minimax.io/models/text/m27

1
1. fxp007 16 Apr 2026
  
  in Public
  
  M2.7 demonstrates excellent performance in real-world software engineering, including end-to-end project delivery, log analysis for bug hunting, code security, and machine learning tasks.
  
  令人惊讶的是：MiniMax M2.7不仅能处理常规编程任务，还能完成端到端的项目交付、日志分析、代码安全检查等复杂软件工程任务，这表明AI已经能够胜任完整的软件开发流程，从编码到安全审计，打破了人们对AI只能辅助编程的固有认知。
  
  surprising ai-capabilities
Visit annotations in context

Tags

surprising

ai-capabilities

Annotators

fxp007

URL

minimax.io/models/text/m27
www.microsoft.com www.microsoft.com

https://www.microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.
  
  令人惊讶的是：同一个AI模型在低需求测试中可能获得90%以上的分数，而在高需求测试中却可能低于15%，这反映了任务需求的不同而非模型能力的改变。这一发现挑战了人们对AI能力稳定性的普遍认知，揭示了任务难度对AI表现的巨大影响。
  
  surprising ai-capabilities task-difficulty
Visit annotations in context

Tags

surprising

ai-capabilities

task-difficulty

Annotators

fxp007

URL

microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/
openai.com openai.com

https://openai.com/index/the-next-evolution-of-the-agents-sdk/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  For example, developers can give an agent a controlled workspace, explicit instructions, and the tools it needs to inspect evidence:
  
  令人惊讶的是：OpenAI的Agents SDK现在允许开发者创建一个完全受控的工作环境，让AI代理可以检查文件、运行命令和编辑代码。这种能力意味着AI系统可以更深入地与计算机系统交互，实现更复杂的任务自动化，这比大多数人想象的AI能力要强大得多。
  
  surprising ai-capabilities development-tools
Visit annotations in context

Tags

surprising

ai-capabilities

development-tools

Annotators

fxp007

URL

openai.com/index/the-next-evolution-of-the-agents-sdk/
news.smol.ai news.smol.ai

https://news.smol.ai/issues/26-04-08-not-much

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Claude Mythos autonomously identified and exploited several significant vulnerabilities. Notably, it discovered a 27-year-old vulnerability in OpenBSD
  
  令人惊讶的是，Claude Mythos能够自主发现并利用一个存在了27年的OpenBSD漏洞。这一事实表明AI模型在网络安全领域的能力已经达到了令人难以置信的水平，能够找到人类专家和安全系统长期未发现的漏洞。这引发了关于AI安全性和控制机制的深刻问题。
  
  surprising cybersecurity ai-capabilities
Visit annotations in context

Tags

surprising

ai-capabilities

cybersecurity

Annotators

fxp007

URL

news.smol.ai/issues/26-04-08-not-much
chatgpt.com chatgpt.com

https://chatgpt.com/apps/spreadsheets/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Some advanced Excel capabilities aren't supported yet, including Office Scripts, Power Query, and Pivot/Data Model, data validation, and the named ranges manager, slicers, timelines, external connection administration, advanced charting breadth, and macro/Visual Basic for Applications (VBA) automation.
  
  令人惊讶的是：尽管ChatGPT for Excel声称能处理复杂的电子表格任务，但它实际上不支持许多高级Excel功能，如VBA宏和Power Query。这表明该AI工具目前更适合基础到中级的电子表格操作，而非高度专业化的Excel工作流程。
  
  surprising excel-limitations ai-capabilities
Visit annotations in context

Tags

surprising

excel-limitations

ai-capabilities

Annotators

fxp007

URL

chatgpt.com/apps/spreadsheets/
arxiv.org arxiv.org

https://arxiv.org/abs/2604.03016

1
1. fxp007 08 Apr 2026
  
  in Public
  
  Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks
  
  大多数人认为当前最先进的多模态大模型已经接近或超越人类在复杂任务上的表现。然而，作者的数据表明，即使是最好的模型在复杂现实任务上的表现也远低于预期，准确率从整体56.3%骤降至23.0%。这一发现挑战了AI领域对当前技术能力的乐观评估，揭示了现实世界多模态代理任务的极端复杂性。
  
  counterintuitive performance-gap ai-capabilities
Visit annotations in context

Tags

ai-capabilities

performance-gap

counterintuitive

Annotators

fxp007

URL

arxiv.org/abs/2604.03016
bramcohen.com bramcohen.com

https://bramcohen.com/p/the-cult-of-vibe-coding-is-insane

1
1. fxp007 08 Apr 2026
  
  in Public
  
  The AI is actually very good at this, especially if you have a conversation with it beforehand. That's what Ask mode is for.
  
  主流观点认为AI工具主要适合生成代码或自动化简单任务，但作者认为AI在代码审查和架构讨论方面表现优异，前提是事先进行充分对话。这挑战了人们对AI能力的传统认知，暗示AI可以作为架构讨论的平等伙伴，而不仅仅是代码生成工具。
  
  non-consensus ai-capabilities counterintuitive
Visit annotations in context

Tags

non-consensus

ai-capabilities

counterintuitive

Annotators

fxp007

URL

bramcohen.com/p/the-cult-of-vibe-coding-is-insane
epoch.ai epoch.ai

https://epoch.ai/blog/introducing-the-ai-chip-owners-explorer

1
1. fxp007 08 Apr 2026
  
  in Public
  
  We estimate that as of the end of 2025, Chinese companies collectively own just over 5% of the cumulative computing power of the leading AI chips sold in recent years
  
  考虑到中国AI产业的快速发展和政府对AI的大力投资，大多数人可能认为中国拥有更大比例的全球AI计算能力，但作者认为中国公司仅拥有约5%的全球AI计算能力。这一数字远低于人们的预期，挑战了关于中国AI技术实力的普遍认知。
  
  non-consensus china-ai-capabilities compute-gap
Visit annotations in context

Tags

non-consensus

compute-gap

china-ai-capabilities

Annotators

fxp007

URL

epoch.ai/blog/introducing-the-ai-chip-owners-explorer
blog.google blog.google

Gemma 4: Byte for byte, the most capable open models

1
1. fxp007 08 Apr 2026
  
  in Public
  
  The edge models feature a 128K context window, while the larger models offer up to 256K
  
  大多数人认为边缘设备/移动设备上的AI模型功能受限，尤其是在处理长上下文方面。但作者声称即使在移动设备上，Gemma 4也能提供128K的上下文窗口，挑战了边缘AI能力有限的普遍认知。
  
  non-consensus edge-ai-capabilities
Visit annotations in context

Tags

edge-ai-capabilities

non-consensus

Annotators

fxp007

URL

blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
Jun 2024
docdrop.org docdrop.org

Video: Ex-OpenAI Employee Just Revealed it ALL! (DocDrop)

1
1. stopresetgo 22 Jun 2024
  
  in Public
  
  be able to quick Master any domain write trillions lines of code and read every research paper in every scientific field ever written
  
  for - AI evolution - projections for capabilities by 2030
  
  AI evolution - projections for 2030 - AI will be able to do things we cannot even conceive of now because their cognitive capabilities are orders of magnitudes faster than our own - Write billions of lines of code - Absorb every scientific paper ever written and write new ones - Gain the equivalent of billions of human equivalent years of experience
  
  AI evolution - projections for capabilities by 2030
Visit annotations in context

Tags

AI evolution - projections for capabilities by 2030

Annotators

stopresetgo

URL

docdrop.org/video/om5KAKSSpNg/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators