6 Matching Annotations
  1. Last 7 days
    1. After a first wave focused on rapid deployment, organizations now need to revisit those first-generation implementations, and redesign early agent architectures around workflow orchestration, observability, governance, and recovery

      大多数人认为AI代理开发应该持续向前推进新技术,但作者认为企业实际上需要回到早期实现进行重建,因为快速部署阶段忽视了基础架构的可靠性问题。这与主流的'不断前进'的AI发展观相悖,暗示了AI发展可能需要经历一个'重建期'而非单纯的演进。

    1. Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

      大多数人认为AI模型会自信地输出有缺陷的代码而不自知,但作者认为Opus 4.8显著提高了自我纠错能力。这挑战了人们对AI模型自我评估能力的普遍怀疑,表明AI可能在代码质量方面比人们预期的更加可靠。

  2. May 2026
    1. 90.6% (1,587) have proved to be valid true positives, and 62.4% (1,094) were confirmed as either high- or critical-severity

      这两个百分比数据点(90.6%验证率,62.4%确认高危率)对于评估AI模型在安全漏洞检测中的可靠性至关重要。90.6%的验证率表明AI模型的误报率相对较低,这在AI安全领域是相当出色的表现。然而,62.4%的确认高危率意味着近40%的AI评估高危漏洞实际严重程度较低,这反映了AI在严重性评估上仍有改进空间。

    1. Most passing SWE-Bench solutions are not accepted by maintainers.

      大多数人认为通过自动化基准测试(如SWE-Bench)通过的AI系统在实际应用中也能表现良好,但作者指出事实恰恰相反——大多数通过测试的解决方案实际上并不被维护者接受。这挑战了AI评估领域的有效性,表明自动化测试可能无法反映真实世界的质量标准。

  3. Apr 2026
    1. Luna could observe the shop through security camera screenshots, but still made basic mistakes, including selecting the wrong country when hiring a contractor and mismanaging staff schedules during opening weekend.

      尽管AI代理在现实世界运营中展示了令人印象深刻的自主性,但它们仍然存在明显的局限性。这一事实提醒我们,当前的AI系统在处理复杂现实情境时仍不可靠,特别是在涉及细节判断和执行方面。这表明AI代理的商业化应用还需要更多的技术突破和测试。

    1. it almost always traces back to the interface rather than the language model

      这是一个极具反直觉的深刻洞见:AI产品的不靠谱往往是界面问题而非模型问题。当我们将责任推给算法黑盒时,作者指出通过优秀的交互设计构建结构和护栏,能有效补偿模型的不确定性,这才是当下的核心设计挑战。