Hypothesis

3 Matching Annotations

Jun 2026
deepmind.google deepmind.google

Untitled document

1
1. fxp007 12 Jun 2026
  
  in Public
  
  Building realistic, reproducible environments to evaluate, compare and accelerate progress across all areas of multi-agent safety. This includes virtual marketplaces, simulated ecosystems and multi-organisation workflows
  
  沙盒和测试床被列为四大优先领域之首，这暗示了当前的根本困境：我们甚至没有标准的、可重现的环境来测试多智能体行为。这与单模型安全研究形成对比——后者有MMLU、TruthfulQA等标准化基准。多智能体安全研究目前的状态，相当于深度学习研究在ImageNet出现之前：大家都知道问题存在，但无法比较进展，无法在共同基础上积累知识。
  
  测试床基准缺失研究基础设施
Visit annotations in context

Tags

研究基础设施

测试床

基准缺失

Annotators

fxp007

URL

deepmind.google/blog/investing-in-multi-agent-ai-safety-research/
sakana.ai sakana.ai

Untitled document

1
1. fxp007 12 Jun 2026
  
  in Public
  
  produces a lineage of warriors, each adapted to a changing environment defined by all of its predecessors
  
  DRQ 的环境定义是动态的：第 N 代战士的「测试集」就是它的所有前辈。这解决了传统 benchmark 的一个根本问题——对抗进化自动生成永不饱和的 curriculum。对应到 LLM 训练：如果模型的评估对手也在不断进化，就不存在「刷榜」问题。这是一种自我更新的能力测量框架。
  
  课程学习进化基准测试
Visit annotations in context

Tags

课程学习

基准测试

进化

Annotators

fxp007

URL

sakana.ai/drq/
May 2026
huggingface.co huggingface.co

https://huggingface.co/papers/2604.20987

1
1. fxp007 01 May 2026
  
  in Public
  
  Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.
  
  在六个游戏环境中进行的实验表明，COSPLAY框架在单人游戏基准测试中，与四个前沿的LLM基线相比，平均奖励提高了25.1%，同时在多人社交推理游戏中也保持了竞争力。
  
  实验结果性能提升基准测试
Visit annotations in context

Tags

实验结果

基准测试

性能提升

Annotators

fxp007

URL

huggingface.co/papers/2604.20987

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL