Hypothesis

Historically, AI evaluation has leaned toward the forest approach. Most researchers settle for 1 to 5 raters per item, assuming this is enough to find a single 'correct' truth.

大多数人认为AI评估领域的现状是合理的，因为1-5名评估者足以找到单一'正确'真相，但作者指出这种假设忽视了人类评估中的自然分歧。这一批判挑战了AI评估领域的现状，暗示当前许多研究结论可能基于不充分的数据收集方法，需要重新审视评估方法的可靠性。

non-consensus ai-reproducibility status-quo

Tags

Annotators

URL