7 Matching Annotations
  1. Apr 2026
    1. WeirdML V2 places models in an unusually resource-constrained environment: models get only five attempts to submit working code, with no access to external tools. This setup has not been the focus of recent RL training.

      大多数人可能认为所有AI评估指标都会反映相同的进步趋势,但研究发现WeirdML V2指标没有显示加速,因为它设置了资源限制环境,而近期强化学习训练并未关注此类设置。这表明AI进步可能受评估方法的影响。

    1. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories

      大多数人认为AI研究数据集是静态的、一次性的收集,但作者提出'活数据集'概念,强调数据需要持续更新才能反映真实使用情况。这挑战了传统AI评估中依赖静态基准测试的做法,主张需要动态、持续的数据收集方法。

    1. Add screenshot-based LLM judge evaluator, screenshot collector, and --parallelize flag

      引入基于截图的LLM评估器和并行化功能是一个令人惊讶的创新。通过截图评估AI模型的性能,可以更直观地理解自动化过程中的视觉理解能力,而并行化功能则大大提高了基准测试的效率,这代表了AI系统评估方法的重要进步。

  2. Oct 2023
    1. Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.

      Them's fighten' words!

      I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.

      The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.

  3. Jun 2023
    1. The Bloom filterswere constructed such that the false positive rate is upperbounded by 1108 . We further verified the low false positiverate by generating 1M strings, of which zero were found bythe filter

      Bloom filters used to determine how much overlap there is between train and test set, to be more sure of their results.