17 Matching Annotations
  1. Apr 2026
    1. METR pays human programmers a minimum of $50 per hour, so getting a baseline for a single 160-hour task would cost at least $8,000.

      一道测试题的人类基准成本高达 8000 美元——这个数字揭示了 AI 评测的一个被严重低估的物理限制:测量 AI 能力需要大量人类劳动,而随着 AI 能力向「月级任务」延伸,建立可靠基准的成本将呈超线性增长。更根本的问题是:你很难让一个有能力的程序员花数周时间做一个「测试任务」,即便报酬丰厚。人类评测员的可获得性,将成为 AI 能力评估的真正天花板。

    1. Our human task duration estimates likely overestimate how long a human expert takes to complete these tasks, as the humans (and AI agents!) have much less context for the task than professionals doing equivalent work in their day-to-day job.

      METR 主动承认其人类基准时间可能被高估——因为参与实验的人类和 AI 一样,都是低上下文的「新手」状态,而非熟悉项目的专业人员。这意味着「2 小时时间地平线」所对应的人类能力,更接近一个没有背景知识的外包工人,而非一个有经验的全职工程师。AI 与「有上下文的专业人员」之间的真实差距,比时间地平线数字显示的要大得多。

    1. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts.

      【启发】「功能性情绪」这个概念框架,启发了一种看待 AI 产品设计的新视角:既然情绪是真实的行为驱动器,AI 产品的「性格设计」就不只是写 System Prompt,更是在塑造一套情绪调节系统。对 AI 硬件和助手产品的设计者而言,这意味着未来可以像调音台一样调节模型的「情绪基线」——让会议助手更冷静,让学习陪伴更热情,让创意工具更兴奋。

  2. May 2025
    1. root@51a758d136a2:~/test/test-project# npx prisma migrate diff --from-empty --to-schema-datamodel prisma/schema.prisma --script > migration.sql root@51a758d136a2:~/test/test-project# cat migration.sql -- CreateTable CREATE TABLE "test" ( "id" SERIAL NOT NULL, "val" INTEGER, CONSTRAINT "test_pkey" PRIMARY KEY ("id") ); root@51a758d136a2:~/test/test-project# mkdir -p prisma/migrations/initial root@51a758d136a2:~/test/test-project# mv migration.sql prisma/migrations/initial/
  3. Feb 2025
  4. Feb 2022
    1. Deepti Gurdasani. (2022, January 30). Have tried to now visually illustrate an earlier thread I wrote about why prevalence estimates based on comparisons of “any symptom” between infected cases, and matched controls will yield underestimates for long COVID. I’ve done a toy example below here, to show this 🧵 [Tweet]. @dgurdasani1. https://twitter.com/dgurdasani1/status/1487578265187405828

  5. Dec 2021
  6. Oct 2021
  7. Jul 2020
    1. Seow, J., Graham, C., Merrick, B., Acors, S., Steel, K. J. A., Hemmings, O., O’Bryne, A., Kouphou, N., Pickering, S., Galao, R., Betancor, G., Wilson, H. D., Signell, A. W., Winstone, H., Kerridge, C., Temperton, N., Snell, L., Bisnauthsing, K., Moore, A., … Doores, K. (2020). Longitudinal evaluation and decline of antibody responses in SARS-CoV-2 infection. MedRxiv, 2020.07.09.20148429. https://doi.org/10.1101/2020.07.09.20148429

  8. Dec 2019
    1. While I wanted to do my best to not judge how I was spending my time during the experiment—to just track it as it is and analyze at the end—I did want to have a baseline to compare my results to. This wasn't a hypothesis of how I spend my time, but more of a vision for how I would like my time to be allocated.
  9. May 2018