4 Matching Annotations
  1. Last 7 days
    1. Btw, I think GLM-5.1 was trying to do something very ambitious here, and failed due to fumbling step size

      令人惊讶的是:GLM-5.1作为一个先进AI模型,竟然因为'步长处理不当'这种技术细节而失败,这表明即使是顶级AI也可能在基础执行层面出现问题,而不仅仅是概念设计上的不足。

    1. We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task.

      令人惊讶的是:研究人员构建的自动化扫描工具发现,所有八个主流AI代理基准测试都存在漏洞,无需解决任何任务就能获得接近完美的分数。这表明整个AI评估领域存在系统性问题,几乎所有当前使用的基准测试都不可靠。

  2. Mar 2026