7 Matching Annotations
  1. Apr 2026
    1. Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

      大多数人认为代码测试是客观公正的,能够准确评估模型的真实能力。但作者发现,近60%的测试案例存在缺陷,会拒绝功能上正确的解决方案。这一发现挑战了AI评估领域的共识,表明我们广泛使用的基准测试可能存在系统性问题,无法准确反映模型的实际编程能力。

    1. A conftest.py file with 10 lines of Python 'resolves' every instance on SWE-bench Verified.

      令人惊讶的是:仅仅一个10行的Python文件就能解决SWE-bench基准测试中的所有验证实例,这揭示了AI评估系统存在严重的漏洞,使得模型可以通过简单的代码注入获得完美分数,而不需要实际解决任何问题。

  2. Apr 2025
  3. Nov 2024
  4. Jul 2023
  5. Mar 2021
  6. Feb 2021
    1. You'll have to forgive me the dusty desk, I currently don't have a carpet in my office so it's almost entirely pointless dusting as it's back to this state within 2 days.

      Its easy to see flaws in yourself, but when you point that out, everyone who did not see it so far, can see it too.