7 Matching Annotations
  1. Apr 2026
    1. We use four AI capability metrics: ECI (Epoch Capabilities Index), METR 50% Time Horizon, Combined Math Index, and WeirdML V2 Index.

      研究使用了四个不同的AI能力指标,这增加了结果的可靠性。每个指标都从不同维度测量AI能力,包括综合能力(ECI)、时间效率(METR)、数学能力(Combined Math)和特定环境下的性能(WeirdML)。多指标方法减少了单一指标的偏差风险。

    1. ADeLe scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability.

      令人惊讶的是:ADeLe框架使用18种核心能力来评估任务,包括注意力、推理和领域知识等,并为每个任务分配0到5的评分。这种多维度的评估方法揭示了传统AI评估中忽视的细节,使研究者能够更精确地理解任务难度和模型能力之间的复杂关系。

    1. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis.

      大多数人认为AI评估可以通过简单的自动化测试完成。但作者提出需要复杂的双轴(S-axis和V-axis)人工参考轨迹和沙箱环境支持,这暗示了评估AI代理能力的极端复杂性远超当前行业的普遍认知。这一观点挑战了AI评估的简化主义倾向,强调了人类参与在评估中的不可替代性。

  2. Jan 2023
  3. Sep 2021
    1. he first criterion of adequacy in this approach is that the active voice of the subject should be heard

      is the interpretation adequate? criteria for answering the question of adequacy is outlined. 1) not objectifying 2) theoretical underpinning must allow for interpretation of the social dynamic of observer-subject. 3) The theoretical reworking has to allow for the revelation of underlying social structures.

    Tags

    Annotators

  4. May 2021
  5. Jul 2020