Models write sloppy code that works but isn't maintainable. Our eval is first to measure: would you actually merge this code?
大多数人认为AI代码评估应该关注功能正确性,但作者认为我们应该评估代码是否真正可合并,这挑战了传统基准测试的共识。FrontierCode引入了'可合并性'这一新标准,关注代码质量而非仅通过测试,这是一个反直觉的转变。
Models write sloppy code that works but isn't maintainable. Our eval is first to measure: would you actually merge this code?
大多数人认为AI代码评估应该关注功能正确性,但作者认为我们应该评估代码是否真正可合并,这挑战了传统基准测试的共识。FrontierCode引入了'可合并性'这一新标准,关注代码质量而非仅通过测试,这是一个反直觉的转变。
Barton, C. M., Alberti, M., Ames, D., Atkinson, J.-A., Bales, J., Burke, E., Chen, M., Diallo, S. Y., Earn, D. J. D., Fath, B., Feng, Z., Gibbons, C., Hammond, R., Heffernan, J., Houser, H., Hovmand, P. S., Kopainsky, B., Mabry, P. L., Mair, C., … Tucker, G. (2020). Call for transparency of COVID-19 models. Science, 368(6490), 482.2-483. https://doi.org/10.1126/science.abb8637