Hypothesis

41 Matching Annotations

Jun 2026
www.qualcomm.com www.qualcomm.com

https://www.qualcomm.com/news/releases/2026/06/qualcomm-unveils-comprehensive-data-center-roadmap-for-the-agent

1
1. fxp007 26 Jun 2026
  
  in Public
  
  AI300 with HBC Gen 2 is designed to enable another stepwise improvement with a 54x increase over AI200
  
  大多数人认为AI芯片性能提升通常是渐进式的，每年大约20-30%的增长，但Qualcomm声称其AI300芯片相比前代AI200有54倍的内存带宽提升，这一指数级增长速度与行业常规认知相悖，暗示AI基础设施可能正在经历范式转变。
  
  non-consensus ai-performance counterintuitive
Visit annotations in context

Tags

counterintuitive

ai-performance

non-consensus

Annotators

fxp007

URL

qualcomm.com/news/releases/2026/06/qualcomm-unveils-comprehensive-data-center-roadmap-for-the-agent
huggingface.co huggingface.co

https://huggingface.co/blog/zai-org/glm-52-blog

1
1. fxp007 17 Jun 2026
  
  in Public
  
  On Terminal-Bench 2.1 (81.0) it lands within a few points of Claude Opus 4.8 (85.0) — while staying ahead of Gemini 3.1 Pro.
  
  大多数人认为开源模型与顶级闭源模型之间存在巨大差距，但作者认为GLM-5.2在终端基准测试中已经接近Claude Opus 4.8的性能，甚至超过了Gemini 3.1 Pro。这一观点挑战了AI领域'闭源模型遥遥领先'的行业共识，表明开源模型在特定编码任务上已经能够与顶级商业模型竞争。
  
  non-consensus ai-performance coding-benchmarks
Visit annotations in context

Tags

coding-benchmarks

ai-performance

non-consensus

Annotators

fxp007

URL

huggingface.co/blog/zai-org/glm-52-blog
www.tomtunguz.com www.tomtunguz.com

https://www.tomtunguz.com/inflation-deflation-ai/

1
1. fxp007 09 Jun 2026
  
  in Public
  
  Open-source models have crossed the good enough threshold for most use cases
  
  主流观点认为闭源模型在性能上始终优于开源模型，但作者认为开源模型已经达到'足够好'的水平，这一观点挑战了商业AI模型的价值主张，暗示开源可能成为企业级应用的主流选择。
  
  non-consensus open-source ai-performance
Visit annotations in context

Tags

open-source

ai-performance

non-consensus

Annotators

fxp007

URL

tomtunguz.com/inflation-deflation-ai/
www.anthropic.com www.anthropic.com

https://www.anthropic.com/news/claude-fable-5-mythos-5

1
1. fxp007 09 Jun 2026
  
  in Public
  
  Claude Fable 5 is the first to break 90% on our core analytics benchmark of complex, long-running analytical tasks — a 10-point jump over Opus. On the hardest questions, it shows strong judgment and attention to nuance.
  
  大多数人认为AI模型在复杂推理任务上的性能提升应该是渐进式的，但作者认为Fable 5实现了质的飞跃，直接突破90%这一关键阈值。这挑战了人们对AI进步的线性预期，暗示可能存在能力阈值一旦突破就会带来显著性能提升的非线性发展模式。
  
  non-consensus ai-performance breakthrough
Visit annotations in context

Tags

breakthrough

ai-performance

non-consensus

Annotators

fxp007

URL

anthropic.com/news/claude-fable-5-mythos-5
www.latent.space www.latent.space

https://www.latent.space/p/ainews-frontiercode-benchmarking

2
1. fxp007 09 Jun 2026
  
  in Public
  
  Even with extended thinking time (10,000 tokens), Python access, and the ability to run experiments, success rates remained below 2%—compared to over 90% on traditional benchmarks.
  
  大多数人认为先进的AI模型已经能够很好地解决编程问题，因为传统基准测试显示高成功率。但作者通过FrontierCode揭示了一个令人意外的真相：即使给予模型更多资源和思考时间，它们在真正困难的编程任务上的成功率仍然极低，表明编程问题远未'解决'。
  
  counterintuitive ai-performance benchmarking
2. fxp007 09 Jun 2026
  
  in Public
  
  The headline result is that the best model, Opus 4.8, scores only about 13% on the hardest subset—far below the 50%+ regime common on SWE-Bench-style evals
  
  大多数人认为AI编程能力已经接近或超越人类水平，但作者指出即使在最先进的模型上，代码质量评估也远低于传统基准测试，暗示编程问题远未解决。这一发现挑战了AI编程能力已成熟的普遍认知。
  
  counterintuitive ai-capabilities coding-performance
Visit annotations in context

Tags

benchmarking

ai-capabilities

counterintuitive

coding-performance

ai-performance

Annotators

fxp007

URL

latent.space/p/ainews-frontiercode-benchmarking
www.anthropic.com www.anthropic.com

https://www.anthropic.com/research/making-claude-a-chemist

1
1. fxp007 06 Jun 2026
  
  in Public
  
  Opus 4.7 matched the experimentally reported splitting pattern more often than any other tool
  
  大多数人认为专业化学软件在预测NMR峰分裂模式方面会比通用AI模型更准确，因为这是它们的核心功能。但作者发现Claude Opus 4.7在预测氢原子NMR峰的分裂模式方面表现优于所有其他工具，包括专业软件。这表明AI模型在理解化学细微结构特征方面可能已经超越了传统专业工具。
  
  non-consensus pattern-recognition ai-performance
Visit annotations in context

Tags

ai-performance

pattern-recognition

non-consensus

Annotators

fxp007

URL

anthropic.com/research/making-claude-a-chemist
www.tomtunguz.com www.tomtunguz.com

https://www.tomtunguz.com/tokens-per-result/

1
1. fxp007 04 Jun 2026
  
  in Public
  
  Benchmarks are now measured on two different dimensions, the overall performance & the cost to achieve that intelligence.
  
  大多数人认为AI评估主要关注性能指标，但作者认为评估标准已经转变为双重维度：性能和成本。这挑战了AI行业长期以来只关注性能的评估传统，暗示成本效率将成为与性能同等重要的评估标准。
  
  counterintuitive ai-benchmarking cost-performance
Visit annotations in context

Tags

ai-benchmarking

cost-performance

counterintuitive

Annotators

fxp007

URL

tomtunguz.com/tokens-per-result/
May 2026
www.anthropic.com www.anthropic.com

Introducing Claude Opus 4.8

2
1. fxp007 29 May 2026
  
  in Public
  
  Opus 4.8 defaults to high effort, which we judge to be the best overall balance of quality and user experience.
  
  大多数人认为AI模型应该追求最高效率和最快响应，但作者认为默认使用'高努力'模式（更频繁、更深入思考）是最佳平衡点。这与行业普遍追求的'速度至上'理念相悖，暗示质量有时需要牺牲效率来获得。
  
  non-consensus ai-performance counterintuitive
2. fxp007 29 May 2026
  
  in Public
  
  Opus 4.8 defaults to high effort, which we judge to be the best overall balance of quality and user experience.
  
  大多数人认为AI模型应该追求最高效率或最低成本，但作者认为高努力程度是最佳平衡点，因为这能提供更好的用户体验和性能。这挑战了AI行业普遍追求速度和效率的主流认知，暗示质量与速度的权衡可能比人们认为的更重要。
  
  non-consensus ai-performance user-experience
Visit annotations in context

Tags

user-experience

counterintuitive

ai-performance

non-consensus

Annotators

fxp007

URL

anthropic.com/news/claude-opus-4-8
www.a16z.news www.a16z.news

https://www.a16z.news/p/avoiding-death-on-the-yellow-brick

1
1. fxp007 29 May 2026
  
  in Public
  
  The best agent businesses are going to need to execute like hedge funds — winning on alpha measured in customer P&L, not in benchmark scores.
  
  这句话用对冲基金作为比喻，生动地描述了优秀AI应用公司的成功标准。作者指出，这些公司需要在客户的实际业务成果（P&L）上获得超额收益（alpha），而不是在通用基准测试上获得高分。这个洞见强调了AI应用公司应该以客户的实际业务价值为中心，而不是技术指标。
  
  insight ai-business-metrics performance
Visit annotations in context

Tags

performance

ai-business-metrics

insight

Annotators

fxp007

URL

a16z.news/p/avoiding-death-on-the-yellow-brick
deepmind.google deepmind.google

https://deepmind.google/blog/alphaevolve-impact/

1
1. fxp007 19 May 2026
  
  in Public
  
  achieving 10% accuracy gains over their competitive manual model optimizations
  
  WPP在广告营销领域实现的10%准确率提升，表明AlphaEvolve在处理复杂、高维度的营销数据方面优于人类专家。这一提升可能直接影响广告投放效果和投资回报率，展示了AI在创意产业中的应用潜力。
  
  data-point marketing ai-performance
Visit annotations in context

Tags

ai-performance

marketing

data-point

Annotators

fxp007

URL

deepmind.google/blog/alphaevolve-impact/
www.thealgorithmicbridge.com www.thealgorithmicbridge.com

Weekly Top Picks #120 - The Algorithmic Bridge

1
1. fxp007 07 May 2026
  
  in Public
  
  The best AI models in the world score below 0.5% on ARC-AGI-3—is this what you call AGI, guys?
  
  0.5%的准确率数据揭示了当前AI模型与通用人工智能(AGI)之间巨大的能力差距。这个极低的分数表明，尽管AI发展迅速，但在真正理解复杂推理方面仍处于非常初级的阶段。作者用讽刺的语气质疑行业过度炒作AGI进展的现象。
  
  data-point ai-performance agi
Visit annotations in context

Tags

agi

ai-performance

data-point

Annotators

fxp007

URL

thealgorithmicbridge.com/p/weekly-top-picks-120
simonwillison.net simonwillison.net

https://simonwillison.net/2026/Apr/30/zig-anti-ai/

1
1. fxp007 01 May 2026
  
  in Public
  
  Bun operates its own fork of Zig, and recently achieved a 4x performance improvement on Bun compile after adding 'parallel semantic analysis and multiple codegen units to the llvm backend'.
  
  尽管Bun项目从AI辅助中受益，但Zig项目坚持其反AI政策，突显了项目间价值观的差异。
  
  performance-improvement project-values ai-assisted-programming
Visit annotations in context

Tags

project-values

performance-improvement

ai-assisted-programming

Annotators

fxp007

URL

simonwillison.net/2026/Apr/30/zig-anti-ai/
Apr 2026
www.kimi.com www.kimi.com

https://www.kimi.com/blog/kimi-k2-6

1
1. fxp007 26 Apr 2026
  
  in Public
  
  Kimi K2.6 demonstrates significant improvements over Kimi K2.5 in internal evaluations conducted by CodeBuddy: code generation accuracy increased by 12%, long-context stability improved by 18%, and tool invocation success rate reached 96.60%.
  
  大多数人认为AI模型迭代通常是渐进式的改进，每次版本更新可能有5-10%的性能提升。但数据显示Kimi K2.6实现了远超预期的飞跃，特别是在工具调用成功率接近97%的情况下，这挑战了人们对AI模型能力提升速度的常规认知，暗示可能存在某种技术突破或架构创新。
  
  counterintuitive performance-leap ai-progress
Visit annotations in context

Tags

ai-progress

performance-leap

counterintuitive

Annotators

fxp007

URL

kimi.com/blog/kimi-k2-6
www.technologyreview.com www.technologyreview.com

https://www.technologyreview.com/2026/04/24/1136422/why-deepseeks-v4-matters/

1
1. fxp007 25 Apr 2026
  
  in Public
  
  DeepSeek V4 exceeds them all on coding, math, and STEM problems, making it one of the strongest open-source models ever released.
  
  大多数人认为开源AI模型在性能上无法匹敌闭源商业模型，但作者认为DeepSeek V4在多个关键领域超越了其他开源模型，甚至与顶级闭源模型相当。这挑战了'开源必然意味着性能妥协'的行业共识，暗示开源模型正在迅速缩小与商业模型的差距。
  
  non-consensus open-source-ai performance
Visit annotations in context

Tags

performance

open-source-ai

non-consensus

Annotators

fxp007

URL

technologyreview.com/2026/04/24/1136422/why-deepseeks-v4-matters/
openai.com openai.com

https://openai.com/index/introducing-gpt-5-5/

1
1. fxp007 24 Apr 2026
  
  in Public
  
  GPT‑5.5 delivers this step up in intelligence without compromising on speed: larger, more capable models are often slower to serve, but GPT‑5.5 matches GPT‑5.4 per-token latency in real-world serving, while performing at a much higher level of intelligence.
  
  大多数人认为更强大的AI模型必然会牺牲速度和效率，但作者认为GPT-5.5打破了这一传统权衡关系，实现了更高智能的同时保持相同延迟。这挑战了AI领域'更大模型必然更慢'的共识，暗示模型架构优化可能比单纯扩大规模更重要。
  
  non-consensus ai-performance counterintuitive
Visit annotations in context

Tags

counterintuitive

ai-performance

non-consensus

Annotators

fxp007

URL

openai.com/index/introducing-gpt-5-5/
arxiv.org arxiv.org

https://arxiv.org/abs/2604.20779

1
1. fxp007 24 Apr 2026
  
  in Public
  
  despite rapidly improving capabilities, coding agents remain inefficient in natural settings
  
  大多数人认为随着AI能力的提升，编程助手的效率会相应提高，但研究发现在实际开发环境中，AI编程助手仍然效率低下。这表明实验室环境下的性能提升不一定能转化为实际工作流程中的效率增益。
  
  non-consensus ai-performance real-world-applications
Visit annotations in context

Tags

real-world-applications

ai-performance

non-consensus

Annotators

fxp007

URL

arxiv.org/abs/2604.20779
openai.com openai.com

https://openai.com/index/introducing-gpt-rosalind/

1
1. fxp007 17 Apr 2026
  
  in Public
  
  When evaluated directly in the Codex app, best-of-ten model submissions ranked above the 95th percentile of human experts on the prediction task and around the 84th percentile of human experts on the sequence generation task.
  
  这一性能指标令人震惊，表明AI在某些任务上已超越95%的人类专家。这不仅是技术进步的标志，也引发了对专业科学家角色和未来就业市场的深刻思考。
  
  ai-performance expertise-superiority
Visit annotations in context

Tags

expertise-superiority

ai-performance

Annotators

fxp007

URL

openai.com/index/introducing-gpt-rosalind/
www.anthropic.com www.anthropic.com

Introducing Claude Opus 4.7

1
1. fxp007 17 Apr 2026
  
  in Public
  
  On our 93-task coding benchmark, Claude Opus 4.7 lifted resolution by 13% over Opus 4.6, including four tasks neither Opus 4.6 nor Sonnet 4.6 could solve.
  
  13%的性能提升在AI领域是显著的飞跃，特别是解决了前代模型完全无法处理的任务，这表明AI能力的非线性发展可能已经到来，而非简单的线性进步。
  
  performance-leap coding-ai
Visit annotations in context

Tags

coding-ai

performance-leap

Annotators

fxp007

URL

anthropic.com/news/claude-opus-4-7
every.to every.to

https://every.to/playtesting/the-market-for-making-ai-better

1
1. fxp007 17 Apr 2026
  
  in Public
  
  A small model trained on fewer than 2,000 examples from real lawyers, bankers, and consultants recently beat all but the best frontier models on corporate legal work, at a fraction of the price.
  
  这一发现挑战了'规模和计算能力胜过一切'的AI发展范式。高质量专业化数据训练的小型模型在特定领域表现优于通用大模型，暗示AI发展可能从'越大越好'转向'更专业、更高效'的新阶段。
  
  ai-performance specialized-models data-quality
Visit annotations in context

Tags

specialized-models

ai-performance

data-quality

Annotators

fxp007

URL

every.to/playtesting/the-market-for-making-ai-better
epoch.ai epoch.ai

https://epoch.ai/blog/mirrorcode-preliminary-results

1
1. fxp007 17 Apr 2026
  
  in Public
  
  We see continued gains from inference scaling on larger projects, suggesting they may be solvable given enough tokens.
  
  这一发现揭示了AI性能与推理计算资源之间的正相关关系，暗示了通过增加计算预算可能解决更复杂的编程任务。这为AI能力的边界提供了重要线索，也引发了关于计算资源投入与AI能力提升之间关系的深刻思考。
  
  inference-scaling compute-budget ai-performance
Visit annotations in context

Tags

compute-budget

inference-scaling

ai-performance

Annotators

fxp007

URL

epoch.ai/blog/mirrorcode-preliminary-results
blog.skypilot.co blog.skypilot.co

https://blog.skypilot.co/research-driven-agents/

1
1. fxp007 17 Apr 2026
  
  in Public
  
  Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.
  
  这一声明揭示了AI代理在代码优化中的关键局限：仅基于代码的优化会产生浅显的假设。通过引入研究阶段，包括阅读学术论文、研究竞争项目和后端实现，代理能够发现更深层次的优化机会，实现了显著的性能提升。这表明AI代理需要更广泛的上下文信息才能做出有意义的创新。
  
  ai-optimization research-phase performance-gain
Visit annotations in context

Tags

research-phase

performance-gain

ai-optimization

Annotators

fxp007

URL

blog.skypilot.co/research-driven-agents/
aphyr.com aphyr.com

https://aphyr.com/posts/419-the-future-of-everything-is-lies-i-guess-new-jobs

1
1. fxp007 17 Apr 2026
  
  in Public
  
  A healthcare LLM might be highly accurate for queries in English, but perform abominably when those same questions are presented in Spanish.
  
  这个例子揭示了AI系统性能的文化和语言敏感性，这是一个令人惊讶但重要的观察。它表明AI系统的'准确性'可能高度依赖于特定语境，这挑战了我们对AI普遍适用性的假设。这种差异可能强化现有的数字鸿沟，并要求开发更具文化敏感性的AI评估框架。
  
  ai-bias cultural-sensitivity performance-variability
Visit annotations in context

Tags

cultural-sensitivity

performance-variability

ai-bias

Annotators

fxp007

URL

aphyr.com/posts/419-the-future-of-everything-is-lies-i-guess-new-jobs
x.com x.com

https://x.com/billtheinvestor/status/2043706042828394747

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Performance: dev-browser: 3m53s, $0.88, 100% success rate — beats MCP configs, Chrome extensions, 'browser skill' stacks.
  
  令人惊讶的是：这种新技术不仅在功能上超越传统方法，在性能指标上也取得了显著优势，100%的成功率和相对较低的成本显示了其技术成熟度和实用性，这可能会使现有的浏览器自动化解决方案迅速过时。
  
  surprising performance cost-efficiency ai-automation
Visit annotations in context

Tags

cost-efficiency

ai-automation

performance

surprising

Annotators

fxp007

URL

x.com/billtheinvestor/status/2043706042828394747
www.xiaohu.ai www.xiaohu.ai

https://www.xiaohu.ai/c/xiaohu-ai/glm-5v-turbo

1
1. fxp007 16 Apr 2026
  
  in Public
  
  GLM-5V-Turbo 拿了 94.8 分，Claude Opus 4.6 是 77.3。差距不小。
  
  令人惊讶的是，在将UI设计稿还原成代码的测试中，GLM-5V-Turbo的得分(94.8)显著领先于Claude Opus 4.6(77.3)，这表明它在视觉编码领域有着惊人的优势，几乎领先了17个百分点，这种差距在AI模型比较中是非常罕见的。
  
  surprising ai-performance coding-models
Visit annotations in context

Tags

coding-models

ai-performance

surprising

Annotators

fxp007

URL

xiaohu.ai/c/xiaohu-ai/glm-5v-turbo
www.technologyreview.com www.technologyreview.com

https://www.technologyreview.com/2026/04/08/1135398/mustafa-suleyman-ai-future/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Where training a language model took 167 minutes on eight GPUs in 2020, it now takes under four minutes on equivalent modern hardware. To put this in perspective: Moore's Law would predict only about a 5x improvement over this period. We saw 50x.
  
  令人惊讶的是：AI模型训练速度在6年内提升了约50倍，远超摩尔定律预测的5倍。这种性能提升不仅来自硬件改进，还来自软件优化和算法创新。这一事实打破了人们对技术进步速度的传统认知，展示了AI领域独特的加速发展模式。
  
  surprising ai-performance hardware-improvement
Visit annotations in context

Tags

hardware-improvement

ai-performance

surprising

Annotators

fxp007

URL

technologyreview.com/2026/04/08/1135398/mustafa-suleyman-ai-future/
www.relvy.ai www.relvy.ai

https://www.relvy.ai

1
1. fxp007 16 Apr 2026
  
  in Public
  
  We improved Claude's RCA accuracy by 12pp on OpenRCA
  
  令人惊讶的是：Relvy声称将Claude的根因分析(RCA)准确度在OpenRCA基准测试中提高了12个百分点，这是一个相当显著的改进，表明AI在系统故障诊断领域可能已经达到了接近人类专家的水平。
  
  surprising ai-performance benchmark
Visit annotations in context

Tags

benchmark

ai-performance

surprising

Annotators

fxp007

URL

relvy.ai
www.microsoft.com www.microsoft.com

https://www.microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
  
  令人惊讶的是：ADeLe方法能够以约88%的准确度预测AI模型在新任务上的表现，这包括像GPT-4o和Llama-3.1这样先进的大模型。这种预测能力远超传统评估方法，为AI性能评估提供了革命性的突破，使研究人员能够更可靠地预见模型在未见过的任务上的表现。
  
  surprising ai-performance prediction-accuracy
Visit annotations in context

Tags

prediction-accuracy

ai-performance

surprising

Annotators

fxp007

URL

microsoft.com/en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/
developer.nvidia.com developer.nvidia.com

https://developer.nvidia.com/blog/nvidia-ising-introduces-ai-powered-workflows-to-build-fault-tolerant-quantum-systems/

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Ising-Calibration-1 repeatedly outperforms state-of-the-art open and closed models of a range of parameters. As shown in Figure 1, Ising Calibration 1 scores 3.27% better on average than Gemini 3.1 Pro, 9.68% better than Claude Opus 4.6, and 14.5% better than GPT 5.4.
  
  令人惊讶的是：专门为量子校准设计的AI模型Ising-Calibration-1竟然在量子校准任务上超越了包括GPT-5.4和Gemini 3.1 Pro在内的最先进通用AI模型，这表明专用AI模型在特定科学任务上可能比通用模型表现更好，颠覆了'通用AI万能'的传统观念。
  
  surprising ai-performance quantum-ai
Visit annotations in context

Tags

quantum-ai

ai-performance

surprising

Annotators

fxp007

URL

developer.nvidia.com/blog/nvidia-ising-introduces-ai-powered-workflows-to-build-fault-tolerant-quantum-systems/
blog.google blog.google

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts

1
1. fxp007 16 Apr 2026
  
  in Public
  
  Artificial Analysis has also positioned Gemini 3.1 Flash TTS within its 'most attractive quadrant' for its ideal blend of high-quality speech generation and low cost.
  
  令人惊讶的是：这个模型不仅质量高，而且成本效益也非常出色，在'最具吸引力象限'中占据一席之地。这表明Google在平衡AI性能和商业可行性方面取得了显著突破，这对大多数用户来说是意想不到的。
  
  surprising cost-performance ai-optimization
Visit annotations in context

Tags

cost-performance

ai-optimization

surprising

Annotators

fxp007

URL

blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts
lumalabs.ai lumalabs.ai

UNI-1 | Less Artificial. More Intelligent. | Luma

1
1. fxp007 09 Apr 2026
  
  in Public
  
  Uni-1 ranks first in human preference Elo for Overall, Style & Editing, and Reference-Based Generation, and second in Text-to-Image.
  
  令人惊讶的是：UNI-1在人类偏好评估中表现如此出色，不仅在整体、风格与编辑以及基于参考的生成方面排名第一，甚至在文本到图像转换这种基础任务上也排名第二，这表明它是一个真正多功能的AI模型，而非仅擅长特定领域。
  
  surprising ai-performance human-preference
Visit annotations in context

Tags

human-preference

ai-performance

surprising

Annotators

fxp007

URL

lumalabs.ai/uni-1
glassmanlab.seas.harvard.edu glassmanlab.seas.harvard.edu

Intro_to_HCI_20_Automation.pdf

1
1. elglassman 08 Apr 2026
  
  in Public
  
  Cai et al. [117] interviewed 21 pathologists who used a deep neural network to aid in thediagnosis of prostate cancer. The interviews showed that pathologists needed to learn moreabout the network’s strengths and limitations to use it effectively. They also wanted to knowthe design objective of the network and the kind of data on which it was trained.
  
  concept: ai-assisted decision making factors influencing human-AI team performance user needs user knowledge desires
Visit annotations in context

Tags

concept: ai-assisted decision making

user needs

factors influencing human-AI team performance

user knowledge desires

Annotators

elglassman

URL

glassmanlab.seas.harvard.edu/annotated_works/Intro_to_HCI_20_Automation.pdf
reducto.ai reducto.ai

https://reducto.ai/blog/reducto-deep-extract-agent

2
1. fxp007 08 Apr 2026
  
  in Public
  
  We've seen customers go from 10-20% field accuracy with a frontier model to 99-100% just by switching to using Reducto's Deep Extract.
  
  大多数人认为从前沿模型到接近完美的准确率需要根本性的技术突破或大量数据训练。但作者声称仅通过切换到Deep Extract方法就能将准确率从10-20%提升到99-100%，这种巨大性能提升的幅度与行业通常预期的改进曲线相悖，暗示现有方法可能存在根本性缺陷。
  
  non-consensus performance-improvement ai-accuracy
2. fxp007 08 Apr 2026
  
  in Public
  
  For the documents that matter most, it gets to 99–100% field accuracy, even out-performing expert human labelers on extraction tasks.
  
  大多数人认为人工智能系统在文档提取任务上总会落后于人类专家，尤其是对于复杂文档。但作者声称Deep Extract可以达到甚至超过人类专家的准确率(99-100%)，这是一个相当大胆的断言，挑战了AI在文档处理领域无法超越人类能力的共识。
  
  non-consensus ai-performance document-extraction
Visit annotations in context

Tags

ai-accuracy

performance-improvement

document-extraction

ai-performance

non-consensus

Annotators

fxp007

URL

reducto.ai/blog/reducto-deep-extract-agent
arxiv.org arxiv.org

https://arxiv.org/abs/2604.03016

1
1. fxp007 08 Apr 2026
  
  in Public
  
  Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks
  
  大多数人认为当前最先进的多模态大模型已经接近或超越人类在复杂任务上的表现。然而，作者的数据表明，即使是最好的模型在复杂现实任务上的表现也远低于预期，准确率从整体56.3%骤降至23.0%。这一发现挑战了AI领域对当前技术能力的乐观评估，揭示了现实世界多模态代理任务的极端复杂性。
  
  counterintuitive performance-gap ai-capabilities
Visit annotations in context

Tags

performance-gap

ai-capabilities

counterintuitive

Annotators

fxp007

URL

arxiv.org/abs/2604.03016
Mar 2026
www.mcgill.ca www.mcgill.ca

Untitled document

5
1. maxhenry 27 Mar 2026
  
  in Public
  
  When the sudden drop to a pianissimo occurred towards the ending of the piece, the perceived arousal responses of CHM and WM dropped slightly but rose again immediately to end on a high arousal. These two groups of listeners appear to have anticipated a return to a loud and majestic close and therefore kept their arousal responses higher than those of the NM.
  
  please highlight anything related to music performance practice
  
  music performance ai-user-approved
2. maxhenry 27 Mar 2026
  
  in Public
  
  CHM, who are more experienced with the instruments and compositional techniques used in Chinese orchestral music, might have had an idea of which features figure more prominently in the communication of particular intentions, and therefore would have more information available for their judgments.
  
  please highlight anything related to music performance practice
  
  music performance ai-user-approved
3. maxhenry 27 Mar 2026
  
  in Public
  
  The perception of affective intentions in music is influenced by the degree of familiarity listeners have with a musical tradition, the content implicated in the music, and the complex sonic environment created by the composer's creation and the musicians' interpretation.
  
  please highlight anything related to music performance practice
  
  music performance ai-user-approved
4. maxhenry 27 Mar 2026
  
  in Public
  
  The version that participants heard was a premier of the work by the Taipei Chinese orchestra.
  
  please highlight anything related to music performance practice
  
  music performance ai-user-approved
5. maxhenry 27 Mar 2026
  
  in Public
  
  The communication of emotions or affect takes place when listeners perceive emotional meaning that is expressed by performers in music (Juslin, 2013a, 2013b).
  
  please highlight anything related to music performance practice
  
  music performance ai-user-approved
Visit annotations in context

Tags

music performance

ai-user-approved

Annotators

maxhenry

URL

mcgill.ca/mpcl/files/mpcl/heng_2026_muspercept.pdf

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators