Hypothesis

5 Matching Annotations

Last 7 days
sakana.ai sakana.ai

https://sakana.ai/fugu/

1
1. fxp007 26 Jun 2026
  
  in Public
  
  For agent products, that may matter more than raw benchmark scores.
  
  大多数人认为AI模型性能的主要衡量标准是基准测试分数，但作者认为在长期交互中保持角色一致性（persona stability）比原始性能分数更重要。这一观点挑战了当前AI评估体系的共识。
  
  non-consensus counterintuitive evaluation-metrics
Visit annotations in context

Tags

non-consensus

evaluation-metrics

counterintuitive

Annotators

fxp007

URL

sakana.ai/fugu/
Apr 2026
epoch.ai epoch.ai

https://epoch.ai/blog/have-ai-capabilities-accelerated

1
1. fxp007 25 Apr 2026
  
  in Public
  
  We use four AI capability metrics: ECI (Epoch Capabilities Index), METR 50% Time Horizon, Combined Math Index, and WeirdML V2 Index.
  
  研究使用了四个不同的AI能力指标，这增加了结果的可靠性。每个指标都从不同维度测量AI能力，包括综合能力(ECI)、时间效率(METR)、数学能力(Combined Math)和特定环境下的性能(WeirdML)。多指标方法减少了单一指标的偏差风险。
  
  data-point metrics evaluation-framework
Visit annotations in context

Tags

evaluation-framework

metrics

data-point

Annotators

fxp007

URL

epoch.ai/blog/have-ai-capabilities-accelerated
mp.weixin.qq.com mp.weixin.qq.com

https://mp.weixin.qq.com/s/_5_tWZeNXmxYnCOfIYg49A

1
1. fxp007 17 Apr 2026
  
  in Public
  
  未来的评估体系，必须同时考虑：成功率、成本、延迟。这有点类似于对于云计算的考核标准，而不是传统软件。
  
  这一观点揭示了AI技能评估需要引入新的维度，特别是成本因素，这反映了AI时代的独特挑战，也暗示未来技能市场可能会出现基于资源消耗的定价机制，这与传统软件市场有本质区别。
  
  cost-evaluation ai-specific-metrics
Visit annotations in context

Tags

cost-evaluation

ai-specific-metrics

Annotators

fxp007

URL

mp.weixin.qq.com/s/_5_tWZeNXmxYnCOfIYg49A
x.com x.com

https://x.com/AlphaSignalAI/status/2043706039334252599

1
1. fxp007 16 Apr 2026
  
  in Public
  
  The standard AI judges use to define "safe" are measured wrong. They punish action. They ignore inaction.
  
  令人惊讶的是：当前AI安全评估标准存在根本性缺陷——它们只惩罚错误行动，却忽视错误的不作为。这种评估方式导致AI模型被优化为看起来安全，但实际上可能因为过度谨慎而变得真正危险。
  
  surprising ai-evaluation safety-metrics
Visit annotations in context

Tags

safety-metrics

ai-evaluation

surprising

Annotators

fxp007

URL

x.com/AlphaSignalAI/status/2043706039334252599
Sep 2017
www.ictliteracy.info www.ictliteracy.info

The State of Broadband 2016: Broadband Catalyzing Sustainable Development

1
1. nateangell 22 Sep 2017
  
  in Public
  
  Develop appropriate measurement and monitoring strategies
  
  Recommendation 4
  
  metrics analytics evaluation
Visit annotations in context

Tags

analytics

evaluation

metrics

Annotators

nateangell

URL

ictliteracy.info/rf.pdf/WG-Education-Report2017.pdf

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL