Hypothesis

6 Matching Annotations

Jun 2026
cognition.ai cognition.ai

https://cognition.ai/blog/frontier-code

1
1. fxp007 08 Jun 2026
  
  in Public
  
  We achieve an 81% lower false positive rate compared to SWE-Bench Pro.
  
  81%的假阳性降低率是一个显著的量化改进，表明FrontierCode在评估代码质量方面比现有基准更准确。这个数据点很有说服力，因为它与现有基准直接比较，显示了评估方法的优越性。
  
  data-point statistics benchmark-comparison
Visit annotations in context

Tags

statistics

benchmark-comparison

data-point

Annotators

fxp007

URL

cognition.ai/blog/frontier-code
May 2026
subq.ai subq.ai

https://subq.ai/introducing-subq

1
1. fxp007 07 May 2026
  
  in Public
  
  Research result of 83 and a production model, third-party verified score of 65.9, SubQ 1M-Preview compares favorably with other SOTA models like Claude Opus 4.7 (32.2), GPT 5.5 (74), and Gemini 3.1 Pro (26.3).
  
  在MRCR v2测试中，SubQ 1M-Preview的生产模型得分为65.9，显著优于Claude Opus 4.7(32.2)、GPT 5.5(74)和Gemini 3.1 Pro(26.3)。这个数据点有力证明了SubQ在多信息检索和推理方面的优越性，接近研究模型的83分。
  
  data-point benchmark comparison
Visit annotations in context

Tags

benchmark

comparison

data-point

Annotators

fxp007

URL

subq.ai/introducing-subq
nlp.elvissaravia.com nlp.elvissaravia.com

https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f

1
1. fxp007 01 May 2026
  
  in Public
  
  DeepSeek-V4-Pro-Max beats GPT-5.2 and Gemini 3.0-Pro on standard reasoning benchmarks and lands just behind GPT-5.4 and Gemini 3.1-Pro
  
  DeepSeek V4-Pro-Max在标准推理基准测试中超越了GPT-5.2和Gemini 3.0-Pro，这表明了开源模型在性能上的巨大提升。
  
  performance-comparison benchmark open-source-model
Visit annotations in context

Tags

performance-comparison

benchmark

open-source-model

Annotators

fxp007

URL

nlp.elvissaravia.com/p/top-ai-papers-of-the-week-f2f
Apr 2026
sakana.ai sakana.ai

https://sakana.ai/fugu-beta/

1
1. fxp007 30 Apr 2026
  
  in Public
  
  GPQAD | 94.4 | 90.9 | 92.7 | 92.4 | **95.1** | LCBv6 | 90.3 | 92.1 | 92.4 | 90.4 | **93.2** | SWEPro | 48.4 | 51.2 | _53.4_ | 51.3 | **54.2**
  
  性能对比表格显示，Sakana Fugu Ultra在三个基准测试中均优于竞争对手：GPQAD上达95.1%（超越Gemini 3.1的94.4%），LCBv6上达93.2%（超越GPT 5.4的92.1%），SWEPro上达54.2%（超越Opus 4.6的53.4%）。这些数据表明其多模型协调策略确实带来了性能提升，特别是在科学推理任务上优势明显。
  
  data-point performance-benchmark model-comparison
Visit annotations in context

Tags

performance-benchmark

model-comparison

data-point

Annotators

fxp007

URL

sakana.ai/fugu-beta/
Feb 2020
github.com github.com

denji/awesome-http-benchmark

1
1. TylerRick 19 Feb 2020
  
  in Public
  
  benchmark tools load testing comparison
Visit annotations in context

Tags

benchmark tools

load testing

comparison

Annotators

TylerRick

URL

github.com/denji/awesome-http-benchmark
Apr 2017
rust-leipzig.github.io rust-leipzig.github.io

A comparison of regex engines

1
1. lyuha 01 Apr 2017
  
  in Public
  
  rust comparison benchmark regex develop
Visit annotations in context

Tags

rust

benchmark

develop

comparison

regex

Annotators

lyuha

URL

rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL