Hypothesis

8 Matching Annotations

Last 7 days
a16z.com a16z.com

https://a16z.com/avoiding-death-on-the-yellow-brick-road/

1
1. fxp007 27 May 2026
  
  in Public
  
  The labs are already routing internally — different model classes for different requests, ensembles under the hood. What they can't do is route across vendors, or evaluate a competitor's model for a specific sub-task, or use an open-source fine-tune for the narrow piece where it's actually best.
  
  大多数人认为大模型实验室拥有绝对优势，可以解决所有AI问题。但作者认为实验室在模型选择上存在结构性限制，无法跨供应商评估模型或为特定子任务使用开源微调模型。这为专注于特定领域的企业提供了机会，它们可以选择最适合每个子任务的模型，而不仅限于自家实验室的模型。
  
  non-consensus model-selection ai-limitations
Visit annotations in context

Tags

model-selection

non-consensus

ai-limitations

Annotators

fxp007

URL

a16z.com/avoiding-death-on-the-yellow-brick-road/
Apr 2026
artificialanalysis.ai artificialanalysis.ai

APEX-Agents-AA Benchmark Leaderboard | Artificial Analysis

2
1. fxp007 10 Apr 2026
  
  in Public
  
  gpt-oss-20B (high): 0.7%
  
  gpt-oss-20B 的成绩是 0.7%——在 452 个专业任务中，只有不到 4 个通过了评测。这个数字与顶级模型的 33.3% 之间，存在近 50 倍的差距。这说明专业服务 Agent 能力不是「渐进改善」，而是存在明确的「能力阶梯」——低于某个规模的模型，在这类任务上几乎完全失效。这对企业 AI 选型的启示：在专业服务场景，「够用的小模型」可能根本不存在，只有「能用的大模型」和「完全不能用的模型」两种。
  
  0.7-percent capability-cliff model-size enterprise-selection
2. fxp007 10 Apr 2026
  
  in Public
  
  Cost (USD) to run the evaluation: GPT-5.4 (xhigh): $1,110, Claude Opus 4.6 (max): $1,055
  
  运行一次 452 个任务的评测，GPT-5.4 花费 1110 美元，Claude Opus 4.6 花费 1055 美元——每个任务平均约 2.3 美元。而 Gemini 3 Flash 只需要 596 美元，实现了 27.7% 的成绩（vs 顶级模型的 33.3%）。这个性价比数据对 AI 选型决策极为关键：如果业务场景可以接受 27% 而非 33% 的成功率，Gemini 3 Flash 能节省近一半成本。在金融服务的大规模部署中，这个差异将被放大数千倍。
  
  cost-analysis 2-dollars-per-task cost-performance model-selection
Visit annotations in context

Tags

cost-analysis

enterprise-selection

2-dollars-per-task

0.7-percent

capability-cliff

model-size

model-selection

cost-performance

Annotators

fxp007

URL

artificialanalysis.ai/evaluations/apex-agents-aa
transformer-circuits.pub transformer-circuits.pub

Emotion Concepts and their Function in a Large Language Model

1
1. fxp007 09 Apr 2026
  
  in Public
  
  we studied emotion-related representations in Claude Sonnet 4.5, a frontier LLM at the time of our investigation.
  
  【启发】这篇论文只研究了 Claude Sonnet 4.5 一个模型，但它的方法论对所有大模型都适用。这启发了一个迫切的研究议程：对不同架构（GPT、Gemini、Qwen、DeepSeek）的情绪向量进行横向比较，会不会发现系统性的情绪偏差——比如某些模型天生更「焦虑」、某些更「冷漠」？这不仅是学术问题，更是产品选型和安全评估的实际需求。
  
  inspiration cross-model-comparison emotion-audit model-selection
Visit annotations in context

Tags

model-selection

inspiration

emotion-audit

cross-model-comparison

Annotators

fxp007

URL

transformer-circuits.pub/2026/emotions/index.html
Aug 2020
www.nber.org www.nber.org

Measuring Employer-to-Employer Reallocation

1
1. katietaylor_99 11 Aug 2020
  
  in BehSci
  
  Fujita, Shigeru, Giuseppe Moscarini, and Fabien Postel-Vinay. ‘Measuring Employer-to-Employer Reallocation’. Working Paper. Working Paper Series. National Bureau of Economic Research, July 2020. https://doi.org/10.3386/w27525.
  
  is:report lang:en employer-to-employer reallocation EE CPS Current Population Survey survey methodology RIP Respondent Identificaion Policy selection model great recession recovery market COVID-19
Visit annotations in context

Tags

lang:en

reallocation

recovery

Respondent Identificaion Policy

Current Population Survey

CPS

EE

survey methodology

market

COVID-19

selection model

is:report

employer-to-employer

RIP

great recession

Annotators

katietaylor_99

URL

nber.org/papers/w27525
Jun 2020
arxiv.org arxiv.org

Statistical inference of assortative community structures

1
1. ErikStuchly 26 Jun 2020
  
  in BehSci
  
  Zhang, L., & Peixoto, T. P. (2020). Statistical inference of assortative community structures. ArXiv:2006.14493 [Cond-Mat, Physics:Physics, Stat]. http://arxiv.org/abs/2006.14493
  
  is:article lang:en statistical inference assortative community structure network partition modeling model selection assortavity significance
Visit annotations in context

Tags

is:article

lang:en

assortative community structure

modeling

significance

network partition

assortavity

model selection

statistical inference

Annotators

ErikStuchly

URL

arxiv.org/abs/2006.14493
arxiv.org arxiv.org

Clustering - What Both Theoreticians and Practitioners are Doing Wrong

1
1. ErikStuchly 25 Jun 2020
  
  in BehSci
  
  Ben-David, S. (2018). Clustering—What Both Theoreticians and Practitioners are Doing Wrong. ArXiv:1805.08838 [Cs, Stat]. http://arxiv.org/abs/1805.08838
  
  is:article lang:en cluster clustering tool machine learning unsupervised learning theory practice algorithm parameter computational task optimization model selection
Visit annotations in context

Tags

is:article

lang:en

machine learning

clustering tool

model selection

unsupervised learning

parameter

optimization

algorithm

computational task

theory

cluster

practice

Annotators

ErikStuchly

URL

arxiv.org/abs/1805.08838
psyarxiv.com psyarxiv.com

All About AIC

1
1. Marlene_Wulf 08 Jun 2020
  
  in BehSci
  
  Del Giudice, M. (2020). All About AIC [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/7hmgz
  
  is:preprint lang:en AIC Bayes factor model selection
Visit annotations in context

Tags

lang:en

Bayes factor

model selection

is:preprint

AIC

Annotators

Marlene_Wulf

URL

psyarxiv.com/7hmgz/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL