1 Matching Annotations
  1. Last 7 days
    1. Współautorka benchmarku OneRuler: nie pokazaliśmy wcale, że język polski jest najlepszy do promptowania
      • Media circulated a claim that Polish language is best for prompting, but this was not a conclusion from the OneRuler study.
      • OneRuler is a multilingual benchmark testing how well language models process very long texts in 26 languages.
      • Models performed on average best with Polish, but differences compared to English were small and not explained.
      • Polish media prematurely concluded Polish is best for prompting, which the study's authors did not claim or investigate.
      • The benchmark tested models on finding specific sentences in long texts, akin to CTRL+F, a function AI models inherently lack.
      • Another task involved listing the most frequent words in a book; models often failed when asked to acknowledge if an answer was not present.
      • Performance dropped likely because the task required full context understanding, not just text searching.
      • Different books were used per language (e.g. Polish used "Noce i dnie," English used "Little Women"), impacting the fairness of comparisons.
      • The choice of books was based on expired copyrights, which influenced the results.
      • There is no conclusive evidence from this benchmark that Polish is superior for prompting due to multiple influencing factors.
      • No model achieved 100% accuracy, serving as a caution about language models' limitations; outputs should be verified.
      • Researchers advise caution especially when using language models for sensitive or private documents.
      • The OneRuler study was reviewed and presented at the CoLM 2025 conference.