Another secondary summary gives Humanity’s Last Exam: 64.7% vs 53.1%, possibly under different setup/effort/tool conditions.
This is a classic example of cherry-picking data to create a narrative of superiority. By presenting a potentially non-comparable benchmark result right after a definitive one, the author casts doubt on the entire benchmarking exercise, allowing them to pick and choose the numbers that best support the 'Mythos is vastly superior' story while ignoring context.