Hypothesis

Współautorka benchmarku OneRuler: nie pokazaliśmy wcale, że język polski jest najlepszy do promptowania

Media circulated a claim that Polish language is best for prompting, but this was not a conclusion from the OneRuler study.
OneRuler is a multilingual benchmark testing how well language models process very long texts in 26 languages.
Models performed on average best with Polish, but differences compared to English were small and not explained.
Polish media prematurely concluded Polish is best for prompting, which the study's authors did not claim or investigate.
The benchmark tested models on finding specific sentences in long texts, akin to CTRL+F, a function AI models inherently lack.
Another task involved listing the most frequent words in a book; models often failed when asked to acknowledge if an answer was not present.
Performance dropped likely because the task required full context understanding, not just text searching.
Different books were used per language (e.g. Polish used "Noce i dnie," English used "Little Women"), impacting the fairness of comparisons.
The choice of books was based on expired copyrights, which influenced the results.
There is no conclusive evidence from this benchmark that Polish is superior for prompting due to multiple influencing factors.
No model achieved 100% accuracy, serving as a caution about language models' limitations; outputs should be verified.
Researchers advise caution especially when using language models for sensitive or private documents.
The OneRuler study was reviewed and presented at the CoLM 2025 conference.