2 Matching Annotations
  1. Last 7 days
    1. Models write sloppy code that works but isn't maintainable. Our eval is first to measure: would you actually merge this code?

      大多数人认为AI代码评估应该关注功能正确性,但作者认为我们应该评估代码是否真正可合并,这挑战了传统基准测试的共识。FrontierCode引入了'可合并性'这一新标准,关注代码质量而非仅通过测试,这是一个反直觉的转变。

  2. Jul 2020