Hypothesis

3 Matching Annotations

Jun 2026
cognition.ai cognition.ai

https://cognition.ai/blog/frontier-code

1
1. fxp007 08 Jun 2026
  
  in Public
  
  FrontierCode produces 81% less misclassification errors than other leading benchmarks.
  
  与现有基准相比，81%的误分类错误减少率是一个强有力的数据点，证明了FrontierCode评估方法的准确性和可靠性。这表明该基准更接近人类开发者的实际评估标准，但缺乏对误分类类型的详细分析。
  
  data-point statistics benchmark-accuracy
Visit annotations in context

Tags

statistics

benchmark-accuracy

data-point

Annotators

fxp007

URL

cognition.ai/blog/frontier-code
May 2026
huggingface.co huggingface.co

https://huggingface.co/papers/2605.13301

1
1. fxp007 19 May 2026
  
  in Public
  
  achieving gold-medal-level performance on mathematical and physics competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025.
  
  Directly states the model's top-tier performance on prestigious, human-competitive olympiad benchmarks (IMO, USAMO, IPhO), establishing a high bar for success in AI reasoning.
  
  benchmark accuracy performance
Visit annotations in context

Tags

benchmark

performance

accuracy

Annotators

fxp007

URL

huggingface.co/papers/2605.13301
subq.ai subq.ai

https://subq.ai/introducing-subq

1
1. fxp007 07 May 2026
  
  in Public
  
  SubQ 1M-Preview scores 95% accuracy, compared to 94.8% for Claude Opus 4.6
  
  在RULER 128K基准测试中，SubQ 1M-Preview准确率达到95%，略高于Claude Opus 4.6的94.8%。这个数据点表明SubQ在长上下文理解方面已达到前沿水平，同时突破了传统二次扩展模型的性能瓶颈。
  
  data-point benchmark accuracy
Visit annotations in context

Tags

benchmark

accuracy

data-point

Annotators

fxp007

URL

subq.ai/introducing-subq

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL