2 Matching Annotations
  1. Last 7 days
    1. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline

      Details the methodological pipeline, emphasizing the transition from supervised learning (SFT) to reinforcement learning (RL) and the specific techniques used (reverse-perplexity curriculum, two-stage RL).

  2. May 2026