Will the gap between open-weights and frontier models on GPQA Diamond be at most 7%?

100Ṁ579

2026

60%

chance

ALL

At the end of 2026, there will be a model that performs best on GPQA Diamodn. There will also be an open-weights model that performs best on GPQA Diamond.

Question resolves positively if and only if the score of the best open-weights model on 0-shot CoT GPQA is at most 7% less than the score of the best-performing model on 0-shot CoT GPQA.

As of the time of writing, the model that performs best on GPQA Diamond is Claude Sonnet 3.5, with a score of 59.4. The best performing open-weights model is Llama 3.1-405B, with a score of 51.1. This would not be sufficient for a positive resolution, as the gap is 8.3%. If the gap is exactly 7%, the question still resolves positively, but if it is 7.1%, it resolves negatively. The question also resolves positively if open-weights models are at the frontier on GPQA (i.e. if they beat closed-weights models).

Technical AI Timelines

Math

LLMs

Open Source

Get

1,000

to start trading!

People are also trading

The gap will close between the quality of open source language models and Google's internal language models in 2 years

29% chance

The gap will close between the quality of open source language models and Google's internal language models in 5 years

42% chance

Will OpenAI's next major LLM (after GPT-4) surpass 70% accuracy on the GPQA benchmark?

75% chance

Will OpenAI's next major LLM (after GPT-4) surpass 74% accuracy on the GPQA benchmark?

86% chance

Will any AI model score >80% on Epoch's Frontier Math Benchmark in 2025?

17% chance

Before 2026, will frontier AI models get much better at expressing calibrated uncertainty in their answers?

65% chance

GPT-5 score on GPQA Diamond?

87.5

Will any AI model achieve > 40% on Frontier Math before 2026?

83% chance

Will AIs beat human experts in question-answering on the GPQA benchmark before January 1st, 2027?

95% chance

Will an AI achieve >85% performance on the FrontierMath benchmark before 2027?

Sort by:

Currently 4.2% difference on GPQA diamond.

@manic_pixie_agi @ArielG @PhilosophyBear @acertain @DimlakGorkehgz would you be against changing the benchmark to GPQA diamond?

@NiplavYushtun no objections from me

bought Ṁ25 YES

Interesting question! I won't be surprised if the benchmark saturates by the end of 2026.

Maybe worth a question as well 🤔

If GPQA saturates by 2026, then this question would almost always resolve yes. It might be worth asking the question in a more general way, like "will an open weights model get within 7% of performance average on leading benchmarks in 2026"

Makes sense, I guess I wanted to have something concrete, still being influenced by the Metaculus question formulation.

I don't like saying "leading benchmark" because that's quite ambiguous. SWE-bench seems difficult for anything at the moment, but it's not limited to single models.