Btw, I think GLM-5.1 was trying to do something very ambitious here, and failed due to fumbling step size
令人惊讶的是:GLM-5.1作为一个先进AI模型,竟然因为'步长处理不当'这种技术细节而失败,这表明即使是顶级AI也可能在基础执行层面出现问题,而不仅仅是概念设计上的不足。
Btw, I think GLM-5.1 was trying to do something very ambitious here, and failed due to fumbling step size
令人惊讶的是:GLM-5.1作为一个先进AI模型,竟然因为'步长处理不当'这种技术细节而失败,这表明即使是顶级AI也可能在基础执行层面出现问题,而不仅仅是概念设计上的不足。
We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task.
令人惊讶的是:研究人员构建的自动化扫描工具发现,所有八个主流AI代理基准测试都存在漏洞,无需解决任何任务就能获得接近完美的分数。这表明整个AI评估领域存在系统性问题,几乎所有当前使用的基准测试都不可靠。
Decades spent educating researchers have had little or no influence on beliefs and practice (Schmidt and Hunter, 1997, pp.20–22).
Calls for reform fall on deaf ears
NHST has been severely criticized for more than 50 years by end users to whom fair statistical communication matters.
Calls for reform fall on deaf ears