Współautorka benchmarku OneRuler: nie pokazaliśmy wcale, że język polski jest najlepszy do promptowania
- Media circulated a claim that Polish language is best for prompting, but this was not a conclusion from the OneRuler study.
- OneRuler is a multilingual benchmark testing how well language models process very long texts in 26 languages.
- Models performed on average best with Polish, but differences compared to English were small and not explained.
- Polish media prematurely concluded Polish is best for prompting, which the study's authors did not claim or investigate.
- The benchmark tested models on finding specific sentences in long texts, akin to CTRL+F, a function AI models inherently lack.
- Another task involved listing the most frequent words in a book; models often failed when asked to acknowledge if an answer was not present.
- Performance dropped likely because the task required full context understanding, not just text searching.
- Different books were used per language (e.g. Polish used "Noce i dnie," English used "Little Women"), impacting the fairness of comparisons.
- The choice of books was based on expired copyrights, which influenced the results.
- There is no conclusive evidence from this benchmark that Polish is superior for prompting due to multiple influencing factors.
- No model achieved 100% accuracy, serving as a caution about language models' limitations; outputs should be verified.
- Researchers advise caution especially when using language models for sensitive or private documents.
- The OneRuler study was reviewed and presented at the CoLM 2025 conference.