WeirdML V2 places models in an unusually resource-constrained environment: models get only five attempts to submit working code, with no access to external tools. This setup has not been the focus of recent RL training.
大多数人可能认为所有AI评估指标都会反映相同的进步趋势,但研究发现WeirdML V2指标没有显示加速,因为它设置了资源限制环境,而近期强化学习训练并未关注此类设置。这表明AI进步可能受评估方法的影响。