Hypothesis

1 Matching Annotations

May 2026
arxiv.org arxiv.org

https://arxiv.org/abs/2605.06445

1
1. fxp007 24 May 2026
  
  in Public
  
  Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero.
  
  大多数人可能认为即使在严格约束下，能力较强的LLM配置仍能保持相对较好的表现，但研究表明即使是最佳配置也会平均下降30个百分点，这挑战了我们对LLM适应能力的认知。
  
  non-consensus performance-decline llm-robustness
Visit annotations in context

Tags

performance-decline

llm-robustness

non-consensus

Annotators

fxp007

URL

arxiv.org/abs/2605.06445