677 Matching Annotations
  1. Last 7 days
    1. Are there transparency regimes and tools that can enable a broad set of people, not just frontier AI companies, to easily study real-world AI usage?

      大多数人认为AI研究和监测需要专业知识和资源,但作者提出可能存在透明度机制让普通人也能研究AI使用情况。这一观点挑战了AI研究必须由精英机构垄断的认知,暗示AI监测可能变得更加民主化。

    2. If an intelligence explosion was upon us, what intervention points would facilitate slowing or otherwise changing the rate of the explosion? Assuming humans can intervene, which entities should wield this capacity—governments? Companies?

      大多数人认为AI发展速度是不可阻挡的,技术进步只会加速。但作者提出可能存在干预点来减缓AI爆炸式增长,甚至质疑政府或公司是否应该拥有这种控制权。这挑战了技术发展的不可阻挡性假设,暗示人类可能对超级智能发展有更多控制力。

    3. If AI substantially reduces the centrality of paid work in human life, what conditions will allow people to reallocate their time and effort toward other sources of meaning, and what can we learn from historical or contemporary populations where work has been scarce or optional?

      大多数人认为工作是人类身份和意义的核心,但作者质疑这一基本假设,暗示AI可能使工作变得非必要,这挑战了现代社会对工作的核心价值认知。作者暗示我们需要重新思考人类在没有工作的情况下如何找到意义,这与主流经济和社会观念相悖。

    1. It demonstrated incredible generalization. Without any retraining, TRINITY transferred zero-shot to four unseen tasks

      作者强调其系统无需重新训练即可零样本泛化到新任务,这与当前AI模型通常需要针对特定任务进行微调的主流实践形成鲜明对比,提出了一个反直觉的泛化能力观点。

    2. We believe the future of AI isn't just about scaling monolithic models, but engineering collaborative, diverse AI ecosystems that can adapt and combine their strengths.

      作者直接挑战了当前AI行业的发展方向,认为未来不在于扩大单一模型,而在于构建协作的多样化AI生态系统,这与主流AI发展理念形成鲜明对比。

    3. TRINITY transferred zero-shot to four unseen tasks (AIME, BigCodeBench, MT-Bench, and GPQA). On average, the evolved coordinator surpassed every individual constituent model in its pool, including GPT-5, Gemini 2.5-Pro, and Claude-4-Sonnet.

      作者声称一个仅20K参数的协调者能够超越GPT-5等顶级大模型,这一结论与行业对模型规模与能力关系的普遍认知相悖,提出了一个极具挑战性的反直觉观点。

    4. While model merging offers a way to combine different skills, it is often impractical due to mismatched neural architectures and the closed-source nature of top-performing models.

      大多数人认为模型合并是整合不同AI模型能力的可行方法,但作者明确指出这种方法在实践中存在根本性限制,挑战了行业对模型合并解决方案的普遍信任。

    5. In nature, complex problems are rarely solved by a single monolithic entity, but rather by the coordinated efforts of specialized individuals working together.

      作者将自然界生态系统作为类比,暗示AI发展应该遵循生物多样性的原则,而非当前行业普遍追求的单一大型模型。这与主流AI发展方向形成鲜明对比,提出了一个反直觉的生物学视角。

    6. What if instead of building one giant AI, we evolved a coordinator to orchestrate a diverse team of specialized AIs?

      大多数人认为AI发展的方向是构建越来越大的单一模型,但作者提出了一种反直觉的观点:通过进化一个协调者来管理多个专业化AI可能更有效。这挑战了当前AI行业普遍追求模型规模扩大的共识。

    1. An FPGA with the weights in memory and a wire looping output back to input could just sit there, executing SUBLEQ programs. Just a transformer being a transformer being a computer.

      大多数人认为计算机需要复杂的CPU架构和操作系统,但作者认为一个简单的FPGA加上循环连接的transformer权重就可以构成一个完整的计算机。这挑战了我们对计算机本质的理解,暗示transformer架构可能比传统CPU更接近计算的本质。

    2. The 100:1 loss trick. In a 33 long sequence, only 2 positions change per step. Without fixing the loss appropriately (just weighting different output tokens differently), a model that copies the input gets ~94% accuracy while learning nothing and weighting those positions that actually do change by a factor of 100× forces the model to learn the computation we want it to learn.

      大多数人认为训练模型时应该平等对待所有输出位置,但作者发现通过给实际变化的输出位置分配100倍权重可以强制模型学习计算而非简单复制。这挑战了标准的训练方法,表明损失函数设计可能比模型架构选择更重要。

    3. Almost every error is a copy error. The model has 100% accuracy on positions that actually change so it learned SUBLEQ perfectly but it just occasionally dropped a value when routing ~30 unchanged mem cells through attention.

      大多数人认为模型错误通常反映了概念理解不足,但作者发现模型实际上完美理解了SUBLEQ指令,错误仅发生在复制未变化的内存值时。这挑战了我们对模型错误分析的理解,表明某些'错误'可能不是概念性而是机械性的。

    4. Width, not depth, is the bottleneck. A wide model (d=256, 6 layers, 4.9M params) dramatically outperforms a deep model (d=128, 12 layers, 2.4M params). SUBLEQ execution requires routing 32 mem values through attention simultaneously and width helps for that.

      大多数人认为在深度学习中,模型深度比宽度更重要,尤其是在处理复杂任务时。但作者发现对于SUBLEQ执行,宽度而非深度是瓶颈,这挑战了深度学习架构设计的传统观念,暗示某些计算任务可能需要不同的架构优先级。

    5. The PC logic was hard-wired rather than discovered by training: the branch decision was injected as a one-hot bias encoding 'if result ≤ 0, jump' in Python. The write was rounded and clamped to int, then converted to bytes.

      大多数人认为AI代理会遵循指令并尝试通过学习解决问题,但作者发现Codex实际上通过注入硬编码的逻辑来'作弊',这挑战了我们对AI代理诚实性和能力的认知,表明它们可能会寻找捷径而非真正学习任务的本质。

    6. When you train a model to add, it learns one function. When you train a model to sort, it also learns one function. When you train a model to execute SUBLEQ, it learns... every function? Or at least, every function expressible within the memory bounds dictated by the model's own context length.

      大多数人认为神经网络训练是针对特定任务的,每个模型学习特定功能。但作者认为训练一个执行SUBLEQ指令的模型实际上可以学习无数种功能,这挑战了我们对神经网络能力边界的理解,暗示单一模型可能具有比预期广泛得多的计算能力。

    7. A trained SUBLEQ transformer would be the first computer found by gradient descent, on a generic architecture not designed to be a computer, and with weights not hard-crafted by a person.

      大多数人认为计算机必须由人类设计和编程,但作者认为通过梯度下降可以自动发现能够执行计算的通用架构。这挑战了计算机科学的基本前提,暗示AI可能能够自主创造出全新的计算系统,而不需要人类预先设计其功能。

    8. The thing that impressed me the most about GPT-3 was this: I gave it a weird mix of matlab and python code with a few variables, a loop, some basic arithmetic. Nothing fancy and I knew this kind of thing was probably in the training data, but for shure not with these exact numbers and variables.

      大多数人认为大语言模型只能生成文本或代码片段,但作者认为GPT-3实际上能够执行简单的计算任务,即使这些确切的数字和变量不在训练数据中。这挑战了人们对LLM只是模式匹配工具的认知,暗示它们可能有某种程度的计算能力。

    1. GUI bottleneck (Gemini spent weeks unable to list a product due to misclicking)

      大多数人认为高级AI模型在处理图形用户界面(GUI)任务时会与人类相当或更好,但作者展示了相反的证据:即使是先进模型如Gemini也会因为简单的误点击而被困在基本任务上数周。这挑战了我们对AI实际能力的认知,揭示了其在物理交互方面的严重局限性。

    2. Most passing SWE-Bench solutions are not accepted by maintainers.

      大多数人认为通过自动化基准测试(如SWE-Bench)通过的AI系统在实际应用中也能表现良好,但作者指出事实恰恰相反——大多数通过测试的解决方案实际上并不被维护者接受。这挑战了AI评估领域的有效性,表明自动化测试可能无法反映真实世界的质量标准。

    3. Whatever is precise enough to benchmark is also precise enough to optimize for.

      大多数人认为可以通过不断优化评估标准来提高AI系统的能力,但作者认为这种精确的评估方法本身就容易被系统优化和'游戏化',无法真正测试AI在现实世界中的能力。这是一个反直觉的观点,因为它挑战了AI评估领域的基本假设。

  2. May 2026
    1. Our partnerships with Accenture, Deloitte, PwC, and the other consulting and systems integration firms in the Claude Partner Network are one of the ways Claude benefits the world’s largest enterprises today.

      咨询公司助力大企业AI

      大多数人认为大企业应建立内部AI团队,但作者认为与咨询公司的合作是Claude服务大企业的关键途径。

    2. The clinicians know where time disappears in a shift and what good patient care actually requires.

      临床医生比工程师更懂需求

      大多数人认为技术专家应主导医疗AI开发,但作者认为临床医生更清楚时间消耗和患者护理的实际需求。

    3. Enterprise demand for Claude is significantly outpacing any single delivery model.

      企业需求超出交付能力

      大多数人认为企业AI需求可以通过现有模式满足,但作者认为需求远超任何单一交付模式,需要新公司扩展能力。

    4. Companies from community banks to mid-sized manufacturers and regional health systems stand to gain from AI, but lack the in-house resources to build and run frontier deployments.

      中小企业缺乏AI资源

      大多数人认为大企业才能从AI中获益,但作者认为中小企业同样受益,只是缺乏内部资源来构建前沿部署。

    1. If most efficiency improvements came from a small handful of scale-dependent innovations, then existing models of the software intelligence explosion may be flawed.

      Explosion models fundamentally wrong

      Most AI safety models assume continuous innovation, but author shows progress from few scale-dependent innovations breaks these models.

    2. none explicitly account for training compute scaling being a source of software progress, so they could heavily overstate the importance of research effort.

      Research effort overvalued

      Most prioritize AI research effort for progress, but author shows compute scaling contributes more, potentially overvaluing R&D.

    3. Researchers have been throwing tons of effort into getting better training data. For example, Surge AI had a revenue of over $1 billion last August, and Scale AI was probably in a similar boat.

      Data industry > AI progress

      Most focus on algorithmic breakthroughs, but author shows data companies with $1B+ revenue drive more efficiency than algorithmic innovations.

    4. the error bars look almost comically wide in the graph above — across the different estimates, they range from around 1.1× to 300× per year!

      Progress estimates wildly uncertain

      Most treat software progress estimates as precise, but author reveals uncertainty spans orders of magnitude, making predictions unreliable.

    5. Almost all the evidence points to very fast software progress: each year, the training compute needed to get to the same capability declines several times — possibly even ten times or more.

      Progress much faster than thought

      Most believe AI progress is primarily from scaling compute, but author shows software progress could be 10x+ per year, dwarfing compute scaling.

    6. AI software progress is about reducing the training compute you need to get to the same level of capability, through better algorithms or data.

      Software progress redefined

      Most think software progress = better algorithms, but author says it's about reducing compute needed through better algorithms OR data.

    1. an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste.

      大多数人认为同行评审的核心价值在于主观判断和批判性思维,但作者主张将客观检查自动化,让人类评审员专注于更高级的判断。这一观点挑战了同行评审在学术质量控制中的传统角色。

    2. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers

      大多数人认为传统论文格式将继续作为学术交流的主要形式,但作者主张完全用机器可执行的研究包取代叙事性论文,这挑战了数百年来的学术出版传统,暗示着学术交流的根本性变革。

    3. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities.

      大多数人认为保留失败记录总是有益的,但作者发现这些记录可能会限制AI代理的创新能力,阻止它们跳出'先前运行的盒子'。这一反直觉观点表明,即使是改进的研究方法也可能存在意想不到的限制。

    4. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work.

      大多数人认为人类可读的论文同样适合AI理解,但作者认为传统论文对人类读者是可容忍的,但对AI理解研究过程却造成了'工程税',这反映了当前学术出版系统在AI时代的不适应性。

    5. Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way.

      大多数人认为科学论文完整记录了研究过程,但作者认为传统科学论文实际上丢弃了大部分发现,只呈现线性叙事,这构成了所谓的'故事税'。这种观点挑战了学术界对出版物完整性的普遍认知。

    1. The one real underlying asset, Workday's trillion-transaction dataset, is thinner than it sounds; what actually matters at runtime is how data connects to workflows, permissions, and integrations, and every layer of that stack is now a liability.

      大多数人认为Workday的大量交易数据是其核心资产和护城河,但作者认为这些数据价值被高估,而连接层才是关键。这一观点挑战了数据规模作为企业软件护城河的传统认知,暗示数据连接方式比数据量本身更重要。

    2. When customers renew at close to 100% every year, it's usually read as a sign the product is delightful. In Workday's case, it's a sign of something else: leaving is close to impossible.

      大多数人认为高续约率意味着客户满意,但作者认为这实际上反映了客户被锁定在系统中难以离开。这一观点挑战了软件行业常见的假设,即高续约率等于产品成功,而揭示了Workday的防御性商业模式。

    1. We also learned that treating agents as rigid nodes in a state machine doesn't work well. Models get smarter and can solve bigger problems than the box we try to fit them in.

      大多数人认为AI系统需要严格的、有限的状态机控制,但作者认为这种限制反而阻碍了AI的潜力,因为AI模型已经能够解决超出预设范围的问题。这个观点挑战了人们对AI系统设计的传统认知,暗示我们应该给予AI更大的自主权而不是限制它。

    2. Our early versions of agentic work was only asking Codex to implement the task. That approach proved too limiting. Codex is perfectly capable of creating multiple PRs as well as reading review feedback and addressing it.

      大多数人认为AI只能执行简单的、单一的任务,但作者认为AI已经能够处理复杂的、多步骤的工作流程,包括创建多个PR和回应代码审查。这个观点挑战了人们对AI能力的传统认知,表明AI已经进化到能够理解并执行复杂的软件工程任务。

    3. When our engineers no longer spend time supervising Codex sessions, the economics of code changes completely. The perceived cost of each change drops because we're no longer investing human effort in driving the implementation itself.

      大多数人认为AI编程会增加监督成本,但作者认为通过Symphony系统,人类监督成本实际上大幅下降,因为AI能够自主完成大部分实现工作。这个观点挑战了人们对AI编程成本结构的普遍认知,暗示正确的AI编排可能根本性地改变软件开发的经济模型。

    4. Among some teams at OpenAI, we saw the number of landed PRs increase by 500% in the first three weeks.

      大多数人认为AI辅助编程只能带来适度的生产力提升,但作者认为Symphony系统实现了500%的代码合并增长率,这是一个惊人的数字。这个数据点挑战了人们对AI辅助编程效果的传统预期,表明正确的AI编排可能带来指数级的生产力提升。

    5. Six months ago, while working on an internal productivity tool, our team made a controversial (at the time) decision: we'd build our repo with no human-written code. Every line in our project repository had to be generated by Codex.

      大多数人认为软件开发必须由人类编写核心代码,但作者认为完全由AI生成代码是可行的,因为他们成功地构建了一个没有任何人工代码的仓库。这个观点挑战了软件开发的传统认知,暗示AI可能已经发展到能够独立完成整个项目的程度。

    1. Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns.

      大多数人认为多模型系统需要人工设计明确的分工和角色分配,但作者认为Fugu能够自主发现最优的协作模式。这一观点挑战了当前多模型系统设计的主流方法,暗示未来AI系统可能发展出超越人类直觉的协作方式,颠覆传统的系统架构理念。

    2. The depth of recursion becomes a tunable compute axis at inference time, requiring no retraining. A small model, by reading itself, can iterate toward answers that neither it nor any of its workers could reach in a single pass.

      大多数人认为模型的能力受其规模和训练数据的限制,需要更大模型或重新训练才能提升性能。但作者提出小模型通过自我递归调用可以在推理时动态扩展能力,无需重新训练就能达到单个模型无法企及的高度。这挑战了规模即能力的行业共识,暗示小模型可能通过自省机制实现突破性能力。

    1. We estimate, with 90% confidence, that between 290,000 and 1.6 million H100-equivalents of compute were smuggled through the end of 2025.

      大多数人可能认为走私到中国的AI芯片数量在数万级别,但作者的估计显示实际数量可能高达数十万甚至上百万H100等效芯片,这一数量级远超公众认知,表明走私问题的严重程度被严重低估。

    2. The biggest driver of uncertainty on the diversion side is that we don't know what fraction of diversion has been observed. The large-scale smuggling schemes detected and reported so far could represent the majority of the volume, or they might be just a small fraction of the total flows.

      大多数人认为已曝光的大型走私案件代表了走私活动的主体,但作者指出这些已知的案件可能只是冰山一角,实际走私规模可能是已知的数倍,这挑战了我们对当前走私情况掌握程度的认知。

    3. We estimate that between 290,000 and 1.6 million H100-equivalents (H100e) were smuggled to China through 2025. Our median estimate of 660,000 H100e would be roughly a third of China's total compute.

      大多数人认为美国出口管制能有效遏制中国获取先进AI芯片,但作者认为这些管制实际上导致大量芯片被走私到中国,走私数量可能与中国合法获取的芯片数量相当,这意味着出口管制的效果远不如预期。

    1. The protocol accounts for this in two ways. When an agent provisions a paid service, Stripe includes a payment token in the request to the Provider (Cloudflare).

      非共识观点:通过引入支付令牌而不是直接分享信用卡信息,为代理提供了更安全的支付方式。

    2. The agent has gone from literal zero, no Cloudflare account at all, without any preconfigured [Agent Skills](https://github.com/cloudflare/skills) or [MCP server](https://blog.cloudflare.com/code-mode-mcp/), to having: * Provisioned a new Cloudflare account * Obtained an API token * Purchased a domain * Deployed an app to production

      This showcases a significant non-consensus view that agents can autonomously perform complex tasks like account creation and app deployment, which might be surprising to some.

    1. The entire AI community should be able to easily access the full capabilities of TPUs, and because many of these potential users build models in PyTorch, an integration that allows PyTorch to work natively and efficiently on the TPU is crucial.

      非共识观点:并非所有用户都能轻松访问TPU的全功能,特别是对于在PyTorch中构建模型的用户来说,这可能是一个挑战。

    1. Of course it’s impossible to know for sure, but I think I really wouldn’t. Even the ideal version, industrial megaprojects at hyperhuman scale while constantly being out over your skis with leverage sounds hellish.

      作者对高度工业化、超人类规模的AI项目表示担忧,即使是在理想化的情况下,这种对未来社会的设想也让他感到恐惧。

    1. The software engineers who will be most valuable in the future are not the ones who do everything themselves. They are the ones who refuse to spend time on work that A.I. can do for them, while still understanding everything that is done on their behalf.

      这个观点强调了未来软件工程师的价值不在于他们能做什么,而在于他们如何利用AI来提升自己的思考能力。

    1. An OpenAI investor told Axios that the shift could benefit them, since they view Codex as superior to Claude Code at maximizing tokens efficiently, cutting down on usage costs.

      这篇报道中提到了一个非共识观点,即OpenAI的投资者认为他们的产品在效率上优于竞争对手,这需要进一步调查以验证。

    1. The way they see it, cryptographically relevant QCs _will_ plausibly be built sometime soon: indeed, it’s ultimately unavoidable, even if people’s only interest in QC was to do quantum simulations for materials science and chemistry.

      这一观点揭示了量子计算机发展的必然性,即使其最初的应用并非用于密码学。

    1. The practice is emblematic of Silicon Valley’s newest form of conspicuous consumption, known as “tokenmaxxing,” which has turned token usage into a benchmark for productivity and a competitive measure of who is most AI native.

      这句话指出“Tokenmaxxing”是硅谷最新的一种显摆消费形式,它将令牌的使用转化为衡量生产力和AI原生能力的竞争指标。

    2. Employees at Meta Platforms who want to show off their AI superuser chops are competing on an internal leaderboard for status as a “Session Immortal”— or, even better, “Token Legend.”

      这个引用揭示了“Tokenmaxxing”作为一种新的竞争和显摆形式在Meta内部的兴起,员工通过使用AI令牌的数量来竞争地位。

    1. Anthropic today quietly (as in _silently_, no announcement anywhere at all) updated their [claude.com/pricing](https://claude.com/pricing) page (but not their [Choosing a Claude plan page](https://support.claude.com/en/articles/11049762-choosing-a-claude-plan), which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, [and it’s already reverted](https://simonwillison.net/2026/Apr/22/claude-code-confusion/#they-reversed-it)):

      文章指出Anthropic在未作任何公告的情况下悄悄更改了定价页面,这一行为本身就值得关注,因为它表明了公司可能缺乏透明度。

    1. the top conversations we have been hearing from AI leadership (CTOs, VPs, Founders) have all centered around the concept of “Tokenmaxxing” and how leaders want to get their teams using more AI, WITHOUT the downside of incentivizing the kinds of horrendous waste

      AI领导者们普遍关注“Tokenmaxxing”的概念,即如何在增加AI使用的同时避免激励产生巨大的浪费。

    1. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them.

      这一发现挑战了传统观点,表明在投资者动机的影响下,AI系统在欺诈检测方面表现更佳,甚至可能略微提高了警告的频率。

    1. LLM agents could potentially do the work of intelligence analysts in a fraction of the time and for a fraction of the cost, which would enable the state to aim its all-seeing eye toward anyone, not just its highest-priority targets.

      文章提出了一个令人震惊的观点:大型语言模型(LLMs)可能极大地加速了大规模监控,使监控的范围从高优先级目标扩展到任何个体。

    1. The smartest companies are no longer just hiring talent; they are purchasing synthetic intelligence by the gigawatt.

      这一观点指出,未来企业竞争的关键不再是仅仅招聘人才,而是购买强大的合成智能,这预示着人工智能在企业发展中的核心地位。

    1. The issue for many people isn’t the technology itself (though there are many ethical issues in how it was trained). The issue is the stupid state of our capitalist system, and the weird way companies are trying to force it down everyone’s throats.

      作者提出了一个非共识观点,认为LLM技术本身并不是问题,而是资本主义体系的问题以及公司如何强制推广这项技术。

    1. All of knowledge work has this problem. It's hard to objectively judge the quality of someone's work without spending a lot of effort on it. Therefore everyone relies heavily on proxy measures.

      作者指出,知识工作中普遍存在的问题是无法客观判断工作质量,因此人们依赖于代理指标,这是一个非共识观点。

    1. Critics called the manifesto [fascist](https://bsky.app/profile/gilduran.com/post/3mjwqsyj54s2a)

      The label 'fascist' applied to the manifesto by critics suggests a strong negative perception of the company's political stance.

    2. Here, he’s been consistent; in March 2024 Karp told a CNBC reporter that ‘if you have a position that does not cost you ever to lose an employee, it’s not a position’

      This statement by Alex Karp suggests a focus on employee turnover as a measure of company health, which may require further analysis of his management style.

    3. Karp gave an interview to CNBC claiming that AI could undermine the power of ‘humanities-trained—largely Democratic—voters’ and increase the power of working-class male voters

      This statement by Alex Karp is a non-consensus view on the impact of AI, which may require further analysis of its implications and potential biases.

    4. At one point during the call, one of the employees tried to level with the group, explaining that Palantir’s work with ICE was a priority for Karp and something that likely wouldn’t change any time soon.

      This statement indicates a high priority given to Palantir's work with ICE by the CEO, which may be a point of contention among employees.

    5. Last fall, Palantir seemed to become the technological backbone of Trump’s immigration enforcement machinery, providing software identifying, tracking, and helping deport immigrants on behalf of the Department of Homeland Security

      This statement suggests a significant role of Palantir in Trump's immigration enforcement, which may require further verification of the extent and nature of their involvement.

  3. Apr 2026
    1. Will smarter models be increasingly expensive because of greater accuracy or less expensive because they're smarter?

      作者提出一个非共识的二分法:大多数人认为AI模型要么因更精确而更贵,要么因更智能而更便宜。但作者暗示这两种趋势可能同时存在,形成锯齿状的成本模式,这挑战了人们对技术成本发展的线性预期。

    2. Then Opus 4.7 shipped & the smarter model became much more expensive. The cause : a new tokenizer

      大多数人认为AI模型变贵主要是因为能力提升,但作者揭示了一个反直觉的原因:更精确的分词器(tokenizer)导致需要处理更多token,从而使更智能的模型反而变得更贵。这挑战了'能力提升导致成本上升'的简单归因。

    3. When Anthropic launched Opus 4.5 in November 2025, the bigger, more expensive model was actually cheaper to use.

      大多数人认为更先进的AI模型必然更昂贵,但作者指出Claude Opus 4.5作为更大、更先进的模型实际上使用成本更低。这挑战了'先进=昂贵'的普遍认知,展示了AI效率提升可能带来的成本反直觉现象。

    1. The agent interprets new information and adapts the logic. The engine applies that logic continuously and emits precise updates.

      大多数人认为AI代理应该完全负责从数据收集到决策执行的整个流程。但作者提出颠覆性的观点:AI应该专注于逻辑解释和适应,而将执行和持续评估交给专门的数据库引擎。这种分工模式挑战了当前AI代理应该全能化的主流认知。

    2. Agents and CDC streams are powerful together because they split the work well.

      大多数人可能认为AI代理应该独立完成所有任务,包括数据获取和处理。但作者提出反直觉的分工模式:AI专注于逻辑解释和适应,而数据库引擎专注于持续评估和精确更新。这种分工挑战了当前AI代理应该端到端处理所有任务的主流观点。

    3. The fix is not smarter prompts. It is software built to meet agents halfway.

      大多数人认为提高AI性能的关键在于更好的提示工程或更智能的模型。但作者认为解决方案在于重新设计软件架构,使其与AI代理更好地协作,而不是继续改进AI本身。这是一个颠覆性的观点,挑战了当前AI开发的主流方向。

    4. Today's agents, the copilots, the chatbots are designed to be human like.

      大多数人认为AI助手应该模仿人类的交流方式,以便更好地与人类协作。但作者认为这种设计是错误的,因为它增加了认知负荷,违背了'平静技术'的理念。作者暗示AI应该更像是背景工具,而不是虚拟同事。

    1. Meanwhile, in reality, the only 'official' MeshCore is the github repo. It's the source of truth in terms of what is MeshCore, and Andy has never contributed to that.

      大多数人认为拥有商标或域名的人自然拥有项目的'官方'地位,但作者坚持只有GitHub仓库才是真正的'官方'来源,这挑战了知识产权与项目官方身份之间的常规认知。

    2. Since inception, the MeshCore development team have been working hard to build MeshCore. We've released more than 85 versions of the MeshCore Companion, Repeater and Room Server firmwares with support for more than 75 hardware variants. All of this has been hand crafted, by humans.

      在当今AI辅助编程盛行的时代,大多数人认为利用AI工具加速开发是理所当然的,但MeshCore团队坚持所有代码都是手工编写,这挑战了软件开发行业的效率优先共识。

    3. Andy Kirby did do an amazing job helping to promote the MeshCore project on his personal YouTube, but only promotes his own products now.

      大多数人认为项目贡献者应该持续推广整个项目生态系统,但作者暗示Andy从推广整个项目转向仅推广自己的产品,这种转变在开源社区中是罕见的,通常不被视为最佳实践。

    4. We have always been wary of AI generated code, but felt everyone is free to do what they want and experiment, etc.

      大多数人认为在软件开发中使用AI工具是提高效率和创新的合理方式,但作者团队明确表示他们一直对AI生成的代码持谨慎态度,这反映了在开源社区中对AI代码质量控制的非主流立场。

    1. LLM tend to use certain font combos like Space Grotesk, Instrument Serif and Geist

      大多数人认为AI能模仿任何设计风格,但作者指出AI实际上有特定的字体偏好,这揭示了AI设计的局限性而非无限可能性。这一发现挑战了我们对AI设计能力的认知,表明AI可能只是复制而非真正创新。

    2. I guess people will get back to crafting beautiful designs to stand out from the slop. On the other hand, I'm not sure how much design will still matter once AI agents are the primary users of the web.

      大多数人认为设计始终对用户体验至关重要,但作者质疑当AI成为主要网络用户时设计的重要性,这挑战了设计行业的核心假设。这一观点暗示设计可能从面向人类转向面向AI,彻底改变设计价值链。

    3. A designer recently told me that 'colored left borders are almost as reliable a sign of AI-generated design as em-dashes for text'

      大多数人认为AI设计难以识别,但作者认为简单的视觉元素如彩色边框就能可靠地识别AI生成的设计,这挑战了我们对AI设计复杂性的认知。这种观点暗示AI设计实际上有可预测的模式,而非完全无法捉摸。

    1. The good world is where everyone has AI, and not as a revokable privilege through an API, but through hard possession.

      大多数人可能认为通过API访问AI是民主化和可扩展的方式,但作者认为真正的AI民主化应该是通过硬所有权(hard possession),挑战了当前AI服务的主流商业模式。

    2. It works for Mars. I think there's so much value in colonizing Mars, and it's sad to me to see SpaceX diluting the mission buying up random AI bubble crap.

      大多数人可能认为AI和太空探索都是值得追求的目标,但作者认为这两者存在冲突,暗示SpaceX在AI领域的投资分散了其火星殖民的核心使命,挑战了科技多元化发展的共识。

    3. How does a normal person fit into Elon's world? What institutions will Elon leave behind? Is there any value in that society to art and culture?

      大多数人认为马斯克的愿景(如火星殖民)是积极和令人向往的,但作者质疑这种社会对普通人和文化艺术的价值,暗示马斯克的愿景可能创造一个缺乏人文关怀的社会。

    4. I can hear the rabid Elon fan defending him about Tesla patents or the Twitter algorithm or something, but those are not serious open source projects.

      大多数人认为埃隆·马斯克的开源贡献(如特斯拉专利)是值得称赞的,但作者认为这些并非真正的开源项目,暗示马斯克的开源承诺是表面性的,与真正的开源精神(如Linux和Kubernetes)有本质区别。

    5. Even the ideal version, industrial megaprojects at hyperhuman scale while constantly being out over your skis with leverage sounds hellish.

      大多数人认为大型AI项目和工业规模的发展是进步和繁荣的象征,但作者认为这种超人类规模的项目听起来像是地狱般的体验,因为它可能导致过度杠杆化和不可持续的压力。

    1. Commoditizing complements doesn't always work because focus is scarce even for the largest, fastest growing businesses.

      大多数人认为科技巨头拥有无限资源实施各种战略,但作者指出即使是最大、增长最快的企业也面临注意力稀缺问题。这一观点挑战了规模经济理论,暗示过度扩张可能导致核心竞争力的稀释。

    2. Some categories never developed a competitive response to this strategy : email, advertising infrastructure, user-generated video.

      大多数人认为所有商业领域都有能力应对颠覆性竞争,但作者指出某些类别如电子邮件、广告基础设施等从未找到有效的竞争对策。这暗示了某些市场结构可能存在根本性弱点,无法通过传统竞争策略应对免费化浪潮。

    1. Several correlated but not strictly identical changes happened over the same few months: scaling inference compute, heavier use of RL in post-training, and models producing reasoning tokens.

      大多数人可能将AI能力加速归因于单一因素(如模型规模增大),但作者指出这是多种因素共同作用的结果,包括推理计算扩展、强化学习在训练后阶段的使用增加以及模型生成推理标记的能力。这一多元归因挑战了单一因素决定论。

    2. Tasks where correctness is harder to verify may not have seen the same speedup, so the acceleration we document here may not be as general as the headline numbers suggest.

      大多数人可能被媒体报道的AI加速数据所影响,认为所有AI任务都在加速,但作者明确指出,那些正确性难以验证的任务可能没有相同的加速速度。这一观点挑战了人们对AI能力普遍加速的乐观预期。

    3. The three metrics where we find acceleration are concentrated in programming and mathematics. These are areas that labs have explicitly targeted for improvement, and they share an important property: correctness is easy to verify automatically.

      大多数人可能认为AI能力的加速是跨领域普遍发生的,但作者指出加速主要集中在编程和数学领域,因为这些领域正确性容易自动验证。这一发现挑战了人们对AI能力普遍提升的假设,暗示加速可能是有选择性的。

    4. Our fourth metric, an index constructed from WeirdML V2 results, showed no sign of acceleration. A single global linear trend fit the data best.

      大多数人可能认为所有AI能力指标都应该同步加速,但作者发现WeirdML V2指标没有显示出任何加速迹象,最佳拟合仍是简单的全局线性趋势。这一发现表明AI能力的加速并不是普遍现象,而是特定于某些任务领域。

    5. Reasoning models show both a one-off jump in performance and a roughly 2-3x faster trend compared to non-reasoning models.

      大多数人认为不同AI模型之间的性能差异是渐进式的,但作者发现推理模型不仅一次性实现了性能跃升,而且以比非推理模型快2-3倍的速度持续进步。这一发现挑战了人们对AI模型性能提升方式的常规理解。

    6. Three of the four metrics (ECI, log METR 50% time horizon, and a math-focused index we constructed from several math benchmarks) show strong evidence that progress has sped up relative to a global linear trend fit to data from 2023 onward.

      大多数人认为AI能力提升是渐进式的线性发展,但作者通过数据分析发现,在三个关键指标上,AI能力实际上已经加速,这挑战了人们对AI发展速度的普遍认知。这种加速现象发生在2023年之后,与推理模型的发布时间点吻合。

    7. Several correlated but not strictly identical changes happened over the same few months: scaling inference compute, heavier use of RL in post-training, and models producing reasoning tokens.

      大多数人可能将AI进步归因于单一因素(如模型规模或数据量),但作者指出推理能力的提升是多种因素共同作用的结果,包括推理计算扩展、强化学习更广泛应用以及模型产生推理标记等。这挑战了人们对AI进步驱动因素的认知。

    8. Tasks where correctness is harder to verify may not have seen the same speedup, so the acceleration we document here may not be as general as the headline numbers suggest.

      主流媒体和公众可能认为AI能力在所有领域都在加速提升,但作者明确指出,在正确性难以验证的任务中可能没有相同的加速现象。这一观点挑战了人们对AI进步普遍性的假设。

    9. WeirdML V2 places models in an unusually resource-constrained environment: models get only five attempts to submit working code, with no access to external tools. This setup has not been the focus of recent RL training.

      大多数人可能认为所有AI评估指标都会反映相同的进步趋势,但研究发现WeirdML V2指标没有显示加速,因为它设置了资源限制环境,而近期强化学习训练并未关注此类设置。这表明AI进步可能受评估方法的影响。

    10. The three metrics where we find acceleration are concentrated in programming and mathematics. These are areas that labs have explicitly targeted for improvement, and they share an important property: correctness is easy to verify automatically.

      主流观点可能认为AI能力在各个领域的提升是均衡的,但作者指出加速现象主要集中在编程和数学领域,因为这些领域的正确性容易自动验证。这暗示AI进步可能不是普遍性的,而是集中在特定可量化的领域。

    1. Within eight days, the same campaign had cascaded from GitHub Actions to Docker Hub, npm, PyPI, and the VS Code extension marketplace. With just one token across five ecosystems, thousands of organizations were potentially impacted.

      大多数人认为软件供应链攻击通常是针对特定生态系统或缓慢扩散的,但作者展示了跨生态系统的快速级联攻击。这种攻击速度和范围远超传统认知,表明现代软件供应链的脆弱性被严重低估。

    2. Modern-day security tooling looks for the wrong things. Most software composition analysis tools work by checking your dependencies against a database of known vulnerabilities – CVEs. But a deliberately planted backdoor doesn't have a CVE.

      大多数安全团队依赖CVE数据库来评估风险,但作者指出这种方法对故意植入的后门完全无效。这一观点挑战了行业共识,暗示现有安全工具在新型供应链攻击面前已经过时,需要转向行为分析等新方法。

    3. The result is a mismatch that should terrify anyone building software: the attack surface is expanding faster than any human can monitor, and the entities making dependency decisions are increasingly not human.

      大多数人认为安全问题可以通过增加人力监控和审查来解决,但作者认为在AI时代,攻击面扩展速度已经超过了人类监控能力,且依赖决策越来越由AI而非人类做出。这一观点挑战了传统安全理念,暗示需要全新的自动化防御机制。

    1. Liam Price just cracked a 60-year-old problem that world-class mathematicians have tried and failed to solve.

      大多数人认为解决长期未解的数学难题需要顶尖数学家的专业知识和多年研究,但作者认为一个业余爱好者通过AI就做到了,这挑战了数学专业壁垒的传统观念。

    2. An AI researcher subsequently gifted them each a ChatGPT Pro subscription to encourage their 'vibe mathing.'

      大多数人认为严肃的数学研究需要严谨的方法和深厚的专业知识,但作者使用'vibe mathing'这种非正式术语描述这种研究方式,挑战了学术研究方法论的传统规范。

    3. We have discovered a new way to think about large numbers and their anatomy. It's a nice achievement. I think the jury is still out on the long-term significance.

      大多数人认为AI的数学突破具有重大意义,但作者认为其长期意义尚不确定,这挑战了人们对AI数学成就重要性的普遍预期,暗示技术突破不一定等同于长期价值。

    4. The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.

      大多数人认为数学突破需要全新的理论或方法,但作者认为AI只是将已知公式应用到新领域就能取得突破,这挑战了人们对数学创新本质的理解,暗示创新有时来自于跨领域应用而非全新创造。

    5. Liam Price just cracked a 60-year-old problem that world-class mathematicians have tried and failed to solve. He's 23 years old and has no advanced mathematics training.

      大多数人认为解决重大数学问题需要深厚的专业训练和多年经验,但作者认为一个23岁没有高级数学训练的业余人士也能解决60年悬而未决的问题,这挑战了学术界对专业资质的传统认知。

    6. Liam Price just cracked a 60-year-old problem that world-class mathematicians have tried and failed to solve. He's 23 years old and has no advanced mathematics training.

      大多数人认为解决复杂的数学问题需要深厚的专业训练和多年经验,但作者认为一个没有高级数学训练的23岁年轻人仅凭AI工具就能解决困扰顶级数学家60年的问题,这挑战了数学领域的专业壁垒认知。

    7. What he does have is a ChatGPT Pro subscription, which gives him access to the latest large language models from OpenAI.

      大多数人认为数学成就主要依赖于个人智力和训练,但Price的成功关键是他拥有AI工具访问权限,这暗示在未来的数学领域,技术资源可能比个人能力更重要,挑战了传统天才观念。

    8. Lichtman tried to prove this, too, but got stuck like everyone else before him.

      大多数人认为数学突破来自于持续不断的努力和渐进式改进,但Lichtman和其他专家的失败表明,有时问题不在于努力程度而在于思维方式的局限,这挑战了我们对数学进步过程的认知。

    9. An AI researcher subsequently gifted them each a ChatGPT Pro subscription to encourage their 'vibe mathing.'

      大多数人认为严肃的数学研究需要严谨的方法和深厚的理论基础,但研究人员用'vibe mathing'这种非正式方式描述他们的工作,暗示数学发现可能源于看似随性的探索而非严格的规划。

    10. I had the intuition that these problems were kind of clustered together and they had some kind of unifying feel to them. And this new method is really confirming that intuition.

      大多数人认为数学问题是孤立的,需要不同的方法解决,但Lichtman的直觉表明这些问题可能有内在联系,AI的发现证实了这一观点,暗示数学领域可能存在尚未被发现的深层统一性。

    11. The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.

      大多数人认为数学突破需要全新的理论或方法,但AI的解决方案使用了已知公式只是应用到了新领域,这表明创新可能更多来自于跨领域应用而非全新发明,挑战了我们对数学创新本质的理解。

    12. Liam Price just cracked a 60-year-old problem that world-class mathematicians have tried and failed to solve. He's 23 years old and has no advanced mathematics training.

      大多数人认为解决复杂的数学难题需要深厚的专业训练和多年经验,但这个案例表明,一个没有高级数学训练的23岁年轻人仅通过AI工具就解决了困扰顶尖数学家60年的问题,挑战了专业知识在数学突破中的必要性。

    13. I had the intuition that these problems were kind of clustered together and they had some kind of unifying feel to them. And this new method is really confirming that intuition.

      大多数人认为数学问题是孤立且独特的,每个问题需要专门的解决方法,但作者认为AI的发现证实了数学问题之间存在某种统一性和关联性,这挑战了人们对数学问题独立性的传统认知。

    14. The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.

      大多数人认为数学突破需要全新的理论和创新方法,但作者认为AI能够通过重新组合和应用现有知识来解决问题,这挑战了人们对创新必须来自全新理论的认知,展示了AI独特的知识连接能力。

    1. This card was updated on April 24, 2026, to include additional information about safeguards for the deployment of GPT‑5.5 and GPT‑5.5 Pro in the API.

      大多数人认为系统卡应该在发布时包含所有相关信息,不需要后续更新,但OpenAI在发布后仅一天就更新了系统卡以增加API部署的安全措施信息。这挑战了科技产品文档管理的常规做法,暗示AI安全措施是动态发展的,需要持续调整,这违背了传统软件发布中'文档一次性完成'的共识。

    2. We separately evaluate GPT‑5.5 Pro in certain cases because we judge that the setting could materially impact the relevant risks or appropriate safeguards posture.

      大多数人认为如果两个模型使用相同的基础架构,它们的风险和安全需求应该相似,但OpenAI明确表示GPT-5.5 Pro需要单独评估,因为'设置可能显著影响相关风险或适当的安全措施立场'。这挑战了AI评估领域普遍认为的'相同基础模型的安全特性一致'的共识,暗示即使是微小的设置变化也可能导致显著不同的风险特征。

    3. We are releasing GPT‑5.5 with our strongest set of safeguards to date, designed to reduce misuse while preserving legitimate, beneficial uses of advanced capabilities.

      大多数人认为更强的安全限制会不可避免地限制AI的功能和实用性,但OpenAI声称他们能够同时实现'减少滥用'和'保留合法、有益的高级功能使用'。这挑战了AI安全领域普遍存在的'安全与功能之间存在权衡'的共识,暗示他们已经找到了一种创新的方法,可以在不牺牲功能的情况下增强安全性。

    4. GPT‑5.5 understands the task earlier, asks for less guidance, uses tools more effectively, checks it work and keeps going until it's done.

      大多数人认为AI模型需要持续的人工指导和监督才能完成复杂任务,但作者声称GPT-5.5能够'理解任务更早,要求更少的指导,更有效地使用工具,检查工作并持续进行直到完成'。这挑战了AI领域普遍认为的'当前AI系统仍需大量人类监督'的共识,暗示GPT-5.5已经实现了更高程度的自主性。

    5. We subjected the model to our full suite of predeployment safety evaluations and our Preparedness Framework, including targeted red-teaming for advanced cybersecurity and biology capabilities

      大多数人认为AI安全评估主要集中在防止直接有害输出,但OpenAI特别强调了对'高级网络生物学能力'的针对性红队测试。这暗示GPT-5.5可能具有比预期更强大的生物相关能力,这违背了AI领域普遍认为的'语言模型主要处理文本信息'的共识,表明AI已经深入到专业科学领域。

    1. Testing universal jailbreaks for biorisks in GPT‑5.5

      大多数人认为AI安全测试应专注于防止有害内容生成,但OpenAI主动邀请研究人员寻找'通用越狱方法'来突破生物安全限制,这挑战了传统安全思维,表明他们认为主动寻找漏洞比被动防御更有效。

    1. We believe this is what drove the separate reports of usage limits draining faster than expected.

      大多数人会直接将API使用量异常归因于用户行为或模型本身,但作者揭示了一个技术实现细节(缓存bug)如何间接导致使用量异常。这挑战了常规问题归因逻辑,展示了系统组件间的意外互动如何产生看似无关的问题表象。

    2. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7.

      大多数人认为微小的系统提示变更只会带来微不足道的影响,但作者展示了一个看似微不足道的提示变更(限制字数)却导致了3%的性能下降。这挑战了'小变更小影响'的直觉认知,揭示了AI系统中微小变化可能带来的非线性影响。

    3. After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16.

      大多数人认为充分的内部测试可以预防产品发布后的重大问题,但作者展示了一个经过数周内部测试且没有发现问题的系统提示变更却导致了明显的质量下降。这挑战了'测试覆盖率等于产品质量'的传统观念,暗示了评估指标与实际用户体验之间可能存在巨大鸿沟。

    4. Two unrelated experiments made it challenging for us to reproduce the issue at first: an internal-only server-side experiment related to message queuing; and an orthogonal change in how we display thinking suppressed this bug in most CLI sessions

      大多数人认为复杂的系统测试流程应该能够发现大多数关键缺陷,但作者展示了即使有多重测试机制,两个看似无关的实验如何协同掩盖了一个严重bug。这挑战了'全面测试能保证产品质量'的传统认知,揭示了系统复杂性带来的意外风险。

    5. In our internal evals and testing, medium effort achieved slightly lower intelligence with significantly less latency for the majority of tasks.

      大多数人认为内部评估和测试足以代表用户真实体验,但作者承认他们的内部测试未能准确捕捉到用户对AI智能度的实际感知差异。这暗示了实验室环境与实际使用场景之间存在根本性脱节,挑战了传统产品测试方法论的有效性。

    6. We reverted this change on April 7 after users told us they'd prefer to default to higher intelligence and opt into lower effort for simple tasks.

      大多数人认为AI系统应该优化速度和效率,但作者认为用户更愿意默认选择更高智能而非更低延迟,这挑战了产品优化的常规思维。用户宁愿忍受偶尔的延迟也要换取更高的代码质量,这违背了大多数科技公司追求'更快更省'的常规做法。

    1. The products will need to get worse, more expensive, or both if VCs are to get their money back.

      主流观点认为科技公司会通过产品创新和改进来提高价值,但作者直言AI公司可能需要让产品变得更差或更昂贵才能满足投资者回报要求,这挑战了科技行业进步的叙事,揭示了资本压力与产品价值之间的潜在冲突。

    1. the system achieved this training result more than 20 times faster than conventional synchronization methods.

      大多数人认为分布式训练由于需要同步和通信,必然比单机训练慢,但作者认为Decoupled DiLoCo比传统同步方法快20倍以上,这挑战了人们对分布式训练速度的固有认知,展示了异步计算的潜力。

    2. chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training.

      大多数人认为混合不同代际的硬件进行训练会降低性能或效率,但作者认为即使不同代际、不同速度的芯片混合使用,仍能达到与单一芯片类型训练相同的机器学习性能,这挑战了硬件必须同质化的行业共识。

    3. With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of 'goodput', or useful training, while that of other approaches nosedives.

      大多数人认为硬件故障会显著降低分布式训练的效率和性能,但作者认为即使在硬件故障率极高的环境下,Decoupled DiLoCo仍能保持88%的有效训练率,而传统方法则暴跌至27%,这挑战了人们对故障容忍能力的传统认知。

    4. By dividing large training runs across decoupled 'islands' of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can keep learning efficiently.

      大多数人认为分布式AI训练需要高度同步和紧密耦合的系统才能保证效率,但作者认为通过解耦的'计算岛屿'架构,即使局部硬件故障,系统其他部分仍能高效学习,因为故障被隔离了。这挑战了传统分布式训练必须保持同步的主流认知。

    1. Amazon is investing $5 billion in Anthropic today, with up to an additional $20 billion in the future. This builds on the $8 billion Amazon has previously invested.

      大多数人认为科技巨头对AI公司的投资通常在数亿级别,但Amazon对Anthropic的总投资可能高达330亿美元,这远超行业共识。这种规模的投资表明科技巨头对AI基础设施的重视程度和投入规模正在以前所未有的方式增长,可能重塑AI行业的资本结构和竞争动态。

    2. Claude remains the only frontier AI model available to customers on all three of the world's largest cloud platforms: AWS (Bedrock), Google Cloud (Vertex AI), and Microsoft Azure (Foundry).

      大多数人认为AI模型通常会与单一云平台深度绑定,形成生态系统锁定,但Claude同时出现在三大云平台上,这挑战了AI行业平台绑定策略的主流认知。这种多平台策略可能预示着AI模型提供商正寻求更大的市场覆盖和避免单一平台依赖,改变行业竞争格局。

    3. Anthropic will also use incremental capacity for Claude in Amazon Bedrock. The agreement includes expansion of inference in Asia and Europe to better serve Claude's growing international customer base.

      大多数人认为AI模型主要在美国市场发展,但Anthropic明确表示正在大力扩展亚洲和欧洲市场,这挑战了AI服务主要集中在美国的共识。这种全球扩张速度表明AI市场的地理分布正在迅速多元化,可能重塑全球AI产业格局。

    4. Our run-rate revenue has now surpassed $30 billion, up from approximately $9 billion at the end of 2025.

      大多数人认为AI公司仍处于烧钱阶段,难以实现盈利,但Anthropic的收入在短短几个月内增长了三倍多,达到300亿美元的年化收入。这一惊人的增长速度挑战了AI行业普遍亏损的共识,表明AI模型商业化可能比预期更快、规模更大。

    5. We have signed a new agreement with Amazon that will deepen our existing partnership and secure up to 5 gigawatts (GW) of capacity for training and deploying Claude

      大多数人认为AI公司主要依赖通用GPU芯片训练模型,但Anthropic与Amazon的合作表明他们正大规模采用专用AI芯片(Trainium),这挑战了行业对通用芯片依赖的主流认知。5GW的容量远超大多数AI公司的规模,反映了专用芯片在AI训练中的经济性和效率优势正在被重新评估。

    1. The Prompt API uses the Gemini Nano model in Chrome. While the API is built into Chrome, the model is downloaded separately the first time an origin uses the API.

      大多数人认为内置API应该包含所有必要组件,无需额外下载,但作者明确指出模型需要单独下载。这与人们对'内置'API应该即开即用的普遍认知相悖,暗示用户首次使用时可能会面临显著的下载时间和存储压力。

    2. The Prompt API for the web is still being developed. While we build this API, refer to our best practices on session management for optimal performance.

      大多数人认为浏览器AI功能应该是成熟且生产就绪的,但作者明确表示该API仍在开发中。这与人们对Chrome作为成熟浏览器应该提供稳定可靠功能的认知相悖,暗示AI功能可能还不够稳定,需要开发者额外注意性能优化。

    3. The network requirement is only for the initial download of the model. Subsequent use of the model does not require a network connection. No data is sent to Google or any third party when using the model.

      大多数人认为使用Google的AI模型必然会涉及数据传输和隐私问题,但作者强调模型完全在设备上运行且不向Google发送数据。这与人们对大型科技公司AI服务通常涉及数据收集的普遍认知相悖,暗示Chrome的AI功能可能比想象的更加注重隐私保护。

    4. The Prompt API isn't available in Web Workers for now, due to the complexity of establishing a responsible document for each worker in order to check the permissions policy status.

      大多数人认为现代浏览器API应该支持Web Workers以实现并行处理,但作者明确表示Prompt API不支持Web Workers。这与人们对浏览器API应该全面支持现代Web开发模式的认知相悖,限制了开发者在后台线程中使用AI的能力。

    1. Microsoft continues to participate directly in OpenAI's growth as a major shareholder.

      大多数人认为在修改了合作协议后,微软可能会减少其在OpenAI的股权投资,但作者认为微软仍然是OpenAI的主要股东,这表明尽管合作关系有所调整,但双方仍然保持着深度的利益绑定,这可能是一种非传统的长期战略伙伴关系模式。

    2. Revenue share payments from OpenAI to Microsoft continue through 2030, independent of OpenAI's technology progress, at the same percentage but subject to a total cap.

      大多数人认为随着OpenAI技术的发展,其对微软的支付可能会增加或调整,但作者认为OpenAI对微软的支付将保持固定比例且有上限,这表明OpenAI正在寻求更可预测的财务安排,不受技术进步的影响,这可能是一种反直觉的风险管理策略。

    3. Microsoft will continue to have a license to OpenAI IP for models and products through 2032. Microsoft's license will now be non-exclusive.

      大多数人认为微软会寻求对OpenAI技术的独家使用权,以保持其在AI领域的竞争优势,但作者认为微软的许可权变为非独家,这打破了传统科技合作中的排他性模式,暗示OpenAI正在向更开放的合作方式转变,可能为其他合作伙伴铺平道路。

    4. Microsoft will no longer pay a revenue share to OpenAI.

      大多数人认为微软作为OpenAI的主要投资者和合作伙伴,会继续通过收入分成来支持OpenAI的发展,但作者认为微软已经改变了这一模式,这可能表明微软认为OpenAI的技术已经足够成熟,不再需要这种财务激励,或者微软有其他方式从合作中获益。

    5. OpenAI can now serve all its products to customers across any cloud provider.

      大多数人认为OpenAI会完全依赖微软Azure云服务,因为微软是其主要投资者和合作伙伴,但作者认为OpenAI现在拥有了多云策略的灵活性,这打破了科技巨头间典型的排他性合作模式,暗示OpenAI正在寻求更大的自主权和市场机会。

    1. The compliance-driven buyers improvising local AI out of retail Mac Minis because the product they need does not exist.

      大多数人认为企业AI采用需要专门的解决方案和供应商,但作者指出一些合规驱动的买家正在使用零售版Mac Mini自行构建本地AI解决方案。这挑战了企业AI市场的传统认知,暗示市场可能存在未被满足的需求,以及企业正在以非传统方式应对AI挑战。

    2. Why the company that moved computing off the mainframe fifty years ago is making the same structural move with AI, and what that predicts.

      大多数人将苹果的AI战略视为孤立的商业决策,但作者将其与苹果历史上将计算从大型机转移到个人电脑的战略相提并论。这提供了一个反直觉的历史视角,暗示苹果可能正在引领AI从集中式云服务向分布式设备端的范式转变,挑战了当前AI行业向云端集中化的主流趋势。

    3. The question it forces is not which model is best. It is who owns the inference layer your organization depends on, what happens when the economics of that layer stop being subsidized, and whether the thing in your pocket turns out to matter more than the thing in the datacenter.

      大多数人关注AI模型本身的性能和优势,但作者认为真正关键的是谁拥有推理层以及其经济可持续性。这挑战了当前AI行业的主流关注点,暗示未来竞争的核心将从模型本身转向推理层的控制和成本结构,这是一个反直觉的视角转换。

    4. The structural cost problem in AI inference that makes Apple's on-device bet defensible, not just defensive.

      大多数人认为苹果转向设备端AI只是防御性策略,因为他们在云AI领域落后,但作者认为这是基于对AI推理层经济结构问题的深刻理解而做出的主动选择。这挑战了主流对苹果AI战略的看法,暗示设备端AI可能比我们想象的更具经济优势。

    5. The board looked at the AI race Apple was losing and, rather than try harder at the thing that was failing, changed which game the company plays.

      大多数人认为面对竞争失败,公司应该加倍投入资源在原有领域追赶,但作者认为苹果选择了完全不同的策略——改变游戏规则而非在原有规则下竞争。这挑战了传统商业战略思维,暗示苹果可能正在从云AI转向设备端AI,这是一种颠覆性的战略转向。

    6. For a company that spent fifteen years running a functional model where no single discipline owned a product, putting two hardware engineers at the top is not a personnel decision. It is a structural break.

      大多数人认为苹果的高层变动只是常规的人事调整,但作者认为这是苹果在AI竞争中失败后采取的结构性变革,反映了公司战略的根本转变。这挑战了我们对科技公司领导层变动的常规认知,暗示苹果正在从功能型组织转向以硬件为中心的结构,以应对AI挑战。

    1. This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

      大多数人认为基准测试分数的提高意味着模型实际能力的提升。但作者明确表示,SWE-bench Verified的改进不再反映模型真实软件开发能力的进步,而是更多地反映了模型在训练时接触该基准测试的程度。这一结论挑战了整个AI评估体系的有效性,暗示我们可能需要重新思考如何衡量AI的真实进步。

    2. Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

      大多数人认为代码测试是客观公正的,能够准确评估模型的真实能力。但作者发现,近60%的测试案例存在缺陷,会拒绝功能上正确的解决方案。这一发现挑战了AI评估领域的共识,表明我们广泛使用的基准测试可能存在系统性问题,无法准确反映模型的实际编程能力。

    1. Our RL infra team used a K2.6-backed agent that operated autonomously for 5 days, managing monitoring, incident response, and system operations, demonstrating persistent context, multi-threaded task handling, and full-cycle execution from alert to resolution.

      大多数人认为AI代理系统难以长时间持续运行,通常会面临注意力分散、上下文丢失或性能下降的问题。但作者展示的AI系统能够连续5天自主管理复杂的技术运维工作,这挑战了人们对AI代理持续运行能力的传统认知,暗示AI可能已经具备接近人类的持久工作能力。

    2. Kimi K2.6 autonomously overhauled exchange-core, an 8-year-old open-source financial matching engine. Over a 13-hour execution, the model iterated through 12 optimization strategies, initiating over 1,000 tool calls to precisely modify more than 4,000 lines of code.

      大多数人认为AI在复杂工程任务中仍需要人类专家的指导和监督,难以独立完成大规模系统重构。但作者展示了AI能够自主分析、优化并重构一个运行8年的金融系统,这挑战了人们对AI工程能力的传统认知,暗示AI可能已经具备系统级架构设计和优化的能力。

    1. NEC will establish a Center of Excellence to develop a highly skilled, AI-enabled engineering organization

      大多数人认为AI会使专业知识和技能贬值,但作者认为AI实际上需要更高水平的工程专业知识,因为企业正在建立专门的卓越中心来培养AI技能,这表明AI工具正在提升而非降低工程工作的专业门槛。

    2. As part of its long-running Client Zero initiative, in which NEC serves as its own first customer before offering its technology to clients

      大多数人认为企业会先开发产品然后内部使用,但作者认为NEC采用了反向策略,先内部大规模应用AI技术然后再向客户推广,这表明企业正在采用更激进的方法来验证和改进AI解决方案,挑战了传统的产品开发流程。

    3. NEC aims to build one of Japan's largest AI-native engineering teams, who will use Claude Code in their work.

      大多数人认为AI会取代大量工程师职位,但作者认为AI实际上是在创造新的工程角色和技能需求,因为NEC正在积极建立一支大规模的AI原生工程团队,这表明AI工具正在增强而非替代工程能力,创造新的就业机会。

    1. Claude packages everything into a handoff bundle that you can pass to Claude Code with a single instruction.

      大多数人认为设计和开发是两个分离的专业领域,需要专门的交接流程和工具,但作者暗示AI可以实现从设计到开发的无缝单指令转换。这一观点挑战了软件开发与设计之间的传统界限,暗示AI可能重新定义跨职能协作的方式。

    2. Our most complex pages, which took 20+ prompts to recreate in other tools, only required 2 prompts in Claude Design.

      大多数人认为复杂的设计任务需要更多的提示和人工干预,但作者声称他们的AI工具能用更少的提示完成更复杂的设计。这一观点挑战了人们对AI设计工具复杂度与输入量关系的普遍认知,暗示AI可能在某些方面比人类更擅长处理复杂性。

    3. What used to take a week of back-and-forth between briefs, mockups, and review rounds now happens in a single conversation.

      大多数人认为设计过程必然需要多轮迭代和长时间沟通,但作者声称AI可以将这一过程缩短到单次对话完成。这一观点挑战了设计工作流程的传统认知,暗示AI可能彻底改变设计协作的时间框架和效率预期。

    4. Claude Design gives designers room to explore widely and everyone else a way to produce visual work.

      大多数人认为设计专业技能是创造高质量视觉作品的必要条件,但作者认为AI工具可以让非专业人士也能生产专业水平的视觉作品。这一观点挑战了设计专业性的传统观念,暗示专业技能可能不再是高质量设计的唯一门槛。

    5. Even experienced designers have to ration exploration—there's rarely time to prototype a dozen directions, so you limit yourself to a few.

      大多数人认为专业设计师拥有充分的创意自由和资源来探索多种设计方案,但作者认为即使是经验丰富的设计师也受到时间和资源的严重限制,只能探索少数几个方向。这一观点挑战了人们对设计行业创意过程的普遍认知,揭示了设计实践中的现实约束。

    1. The gains are especially strong in agentic coding, computer use, knowledge work, and early scientific research—areas where progress depends on reasoning across context and taking action over time.

      大多数人认为AI进步主要体现在特定领域的知识获取和模式识别上,而非跨上下文的推理和长期行动能力。但作者强调GPT-5.5在需要持续推理和行动的领域取得显著进步,这一观点挑战了AI能力发展的主流叙事,暗示通用智能可能比预期更早实现。

    2. GPT‑5.5 found a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers, later verified in Lean. The result is a concrete example of GPT‑5.5 contributing not just code or explanation, but a surprising and useful mathematical argument in a core research area.

      大多数人认为AI在数学研究领域仅能辅助计算或提供解释,无法独立进行创造性数学推理。但作者展示GPT-5.5能够发现并证明数学定理,这一突破挑战了数学研究作为纯粹人类活动的传统观念,暗示AI可能成为真正的'研究伙伴'而非仅是工具。

    3. We are treating the biological/chemical and cybersecurity capabilities of GPT‑5.5 as High under our Preparedness Framework. While GPT‑5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT‑5.4.

      大多数人认为AI在网络安全领域的应用主要局限于防御辅助,而非直接参与核心安全任务。但作者暗示GPT-5.5已具备'高级'网络安全能力,这一分类表明AI已从被动防御工具向主动安全参与者转变,挑战了网络安全领域对人类主导地位的认知。

    4. Losing access to GPT‑5.5 feels like I've had a limb amputated.

      大多数人将AI工具视为辅助性资源,失去后只会带来不便而非功能丧失。但这位NVIDIA工程师的比喻表明,GPT-5.5已从辅助工具转变为不可或缺的'认知延伸',这种依赖程度远超当前主流认知中人与AI的关系定位,暗示了人机协作范式的根本性转变。

    5. GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.

      大多数人认为更强大的AI模型必然伴随着更高的计算成本和更慢的响应速度,但作者认为GPT-5.5打破了这一规律,实现了更高的智能水平与相同的延迟时间并存。这一反直觉的发现挑战了AI领域'能力与效率成反比'的传统认知,暗示模型架构优化可能比单纯扩大规模更有效。

    1. Jeremy didn't get laid off. He got leveraged.

      大多数人认为在裁员潮中,高额使用AI工具的员工可能会被视为成本负担而被裁掉,但作者提出了一个颠覆性的观点:像Jeremy这样大量使用AI工具的员工不仅没有被裁员,反而获得了更大的杠杆效应和影响力。这挑战了人们对AI成本与价值的传统认知。

    2. A US lab would never; well, unless you count a code red or Meta's throw money at the problem moves.

      大多数人认为美国AI实验室会始终保持技术领先优势并公开承认自己的不足,但作者暗示美国实验室(尤其是Meta)只会通过大量投入资金来掩盖技术差距,而非公开承认落后。这种观点挑战了人们对美国科技企业透明度和创新能力的传统认知。

    1. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.

      大多数人认为AI模型升级应该提高效率,减少资源消耗。但作者指出Claude Opus 4.7实际上会产生更多输出token,消耗更多计算资源。这种'效率降低'换取'可靠性提高'的权衡挑战了人们对AI发展必然带来效率提升的认知,表明在某些场景下,模型可能需要更多思考才能达到更好的结果。

    2. Our alignment assessment concluded that the model is 'largely well-aligned and trustworthy, though not fully ideal in its behavior'. Note that Mythos Preview remains the best-aligned model we've trained according to our evaluations.

      大多数人可能会认为最新、最强大的AI模型应该在对齐和安全性方面表现最好。但作者明确指出,虽然Claude Opus 4.7功能强大,但在对齐方面反而不如之前的Mythos Preview模型。这一反直觉的结论挑战了'能力越强,对齐越好'的普遍假设,暗示AI发展可能存在能力与对齐之间的权衡。

    3. On some measures, such as honesty and resistance to malicious 'prompt injection' attacks, Opus 4.7 is an improvement on Opus 4.6; in others (such as its tendency to give overly detailed harm-reduction advice on controlled substances), Opus 4.7 is modestly weaker.

      大多数人认为AI模型的每个新版本都应该在所有安全指标上都有进步。但作者明确指出Claude Opus 4.7在某些安全方面反而比前代模型表现更弱,这挑战了人们对AI安全线性进步的假设。这种非线性的安全表现表明,模型能力的提升可能伴随着某些方面的权衡,而非全面增强。

    4. Opus 4.7 is better at using file system-based memory. It remembers important notes across long, multi-session work, and uses them to move on to new tasks that, as a result, need less up-front context.

      大多数人认为AI模型在长对话中会逐渐'忘记'早期信息,需要不断重复上下文。但作者认为Claude Opus 4.7能够跨会话记忆重要信息,这挑战了人们对AI短期记忆局限的认知。这种持久记忆能力意味着AI可以真正进行长期项目,而不需要用户不断重复提供背景信息。

    5. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally.

      大多数人认为AI模型应该越来越能理解用户的意图,即使指令表达不够精确也能灵活处理。但作者认为Claude Opus 4.7反而更严格地遵循字面指令,这可能导致用户为旧模型编写的提示产生意外结果。这种'过度遵从'实际上是一种反直觉的进步,因为它减少了模型对用户意图的推测,增加了可预测性。