LLM benchmarks are essential for tracking progress and ensuring safety in AI, but most benchmarks don't measure what matters.
Paper concludes most benchmarks used for LLMs to establish progress are mistargeted / leave out aspects that matter.
LLM benchmarks are essential for tracking progress and ensuring safety in AI, but most benchmarks don't measure what matters.
Paper concludes most benchmarks used for LLMs to establish progress are mistargeted / leave out aspects that matter.
PDF paper Saved Measuring what Matters: Construct Validity in Large Language Model Benchmarks in Zotero Paper for NeurIPS 2025 conference https://neurips.cc/Conferences/2025
let’s do some measurements using Audit tool of Google Chrome
the benchmarks that Rich chose weren't even remotely good ones. They had obvious flaws that even the authors acknowledge and Svelte's implementation actually cheats what it was testing.
compared the speed of DeviceDetector with the two most popular user agent parsers in the Ruby community, Browser and UserAgent.