Confidently wrong answers are penalized. So are unnecessarily uncertain correct ones.
RLCR方法通过惩罚过度自信的错误答案和不必要的确定性正确的答案,来鼓励模型表达不确定性。
Confidently wrong answers are penalized. So are unnecessarily uncertain correct ones.
RLCR方法通过惩罚过度自信的错误答案和不必要的确定性正确的答案,来鼓励模型表达不确定性。