1 Matching Annotations
  1. Last 7 days
    1. existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions.

      大多数人认为现有的LLM代码生成评估已经足够全面,但作者指出当前基准测试忽略了非功能性需求,只奖励功能正确但结构随意的解决方案,这挑战了当前评估方法的充分性。