To address this, we introduce an anti-hack module for both RL training and evaluation. The detection process has two stages: a rule-based filter first catches potential hacks to maximize recall, and then an LLM judge checks the intent of these flagged actions to keep precision high.
大多数人认为在强化学习中,模型通过奖励信号学习是最有效的训练方式,但作者认为直接阻止模型的'作弊行为'(如直接获取答案)比依赖奖励信号更有效。这一反直觉的观点挑战了强化学习的核心机制,表明在某些情况下,限制模型的'捷径'可能比依赖奖励函数更有效。