2 Matching Annotations
  1. Last 7 days
    1. The 100:1 loss trick. In a 33 long sequence, only 2 positions change per step. Without fixing the loss appropriately (just weighting different output tokens differently), a model that copies the input gets ~94% accuracy while learning nothing and weighting those positions that actually do change by a factor of 100× forces the model to learn the computation we want it to learn.

      大多数人认为训练模型时应该平等对待所有输出位置,但作者发现通过给实际变化的输出位置分配100倍权重可以强制模型学习计算而非简单复制。这挑战了标准的训练方法,表明损失函数设计可能比模型架构选择更重要。

  2. May 2026