By handling the specific invalid behavior instead of rejecting the entire trajectory, this approach helps prevent the training instability and model collapse that can happen when rollouts are abruptly stopped.
大多数人认为在AI训练中发现不良行为时应立即终止整个训练轨迹,但作者认为应该处理特定无效行为而非拒绝整个轨迹。这一观点挑战了AI训练中的'一刀切'方法,表明更精细化的行为管理可以防止训练不稳定和模型崩溃,从而提高训练效率。