tuning a standalone evaluator to be skeptical turns out to be far more tractable
深刻揭示了LLM自我评价的局限性:生成器难以对自身工作保持批判性。通过解耦生成与评估,并刻意调优独立评估器的“怀疑态度”,能有效打破AI自嗨的闭环。这种对抗式架构是提升输出质量的强效杠杆。
tuning a standalone evaluator to be skeptical turns out to be far more tractable
深刻揭示了LLM自我评价的局限性:生成器难以对自身工作保持批判性。通过解耦生成与评估,并刻意调优独立评估器的“怀疑态度”,能有效打破AI自嗨的闭环。这种对抗式架构是提升输出质量的强效杠杆。
Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.
Them's fighten' words!
I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.
To be a successful physicist requires mastering how to make all 29 decisions, but the reflection decisions (decisions 23–26) are arguably the most difficult to learn.
Of the 29 problem solving decisions identified as important the three "reflection decisions" (23-26 in the list) may be the most difficult to learn as they require metacognition and self-evaluation.
Spreckelsen, P. von, Wessel, I., Glashouwer, K., & Jong, P. J. de. (2020). Preprint Averting Repulsion? Body-Directed Self-Disgust and Autobiographical Memory Retrieval. https://doi.org/10.31234/osf.io/qhc35
Midgley, C., Thai, S., Lockwood, P., Kovacheff, C., & Page-Gould, E. (2020). When Every Day is a High School Reunion: Social Media Comparisons and Self-Esteem [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/zmy29