TRINITY transferred zero-shot to four unseen tasks (AIME, BigCodeBench, MT-Bench, and GPQA). On average, the evolved coordinator surpassed every individual constituent model in its pool, including GPT-5, Gemini 2.5-Pro, and Claude-4-Sonnet.
作者声称一个仅20K参数的协调者能够超越GPT-5等顶级大模型,这一结论与行业对模型规模与能力关系的普遍认知相悖,提出了一个极具挑战性的反直觉观点。
