The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline
Details the methodological pipeline, emphasizing the transition from supervised learning (SFT) to reinforcement learning (RL) and the specific techniques used (reverse-perplexity curriculum, two-stage RL).