Will an AI score over 80% on FrontierMath Benchmark in 2025

1kṀ100k

Dec 31

21%

chance

ALL

"Today we're launching FrontierMath, a benchmark for evaluating advanced mathematical reasoning in AI. We collaborated with 60+ leading mathematicians to create hundreds of original, exceptionally challenging math problems, of which current AI systems solve less than 2%.
Existing math benchmarks like GSM8K and MATH are approaching saturation, with AI models scoring over 90%—partly due to data contamination. FrontierMath significantly raises the bar. Our problems often require hours or even days of effort from expert mathematicians.
We evaluated six leading models, including Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro. Even with extended thinking time (10,000 tokens), Python access, and the ability to run experiments, success rates remained below 2%—compared to over 90% on traditional benchmarks."

Technology

Technical AI Timelines

OpenAI

IMO Grand Challenge

Get

1,000

to start trading!

People are also trading

Will a Chinese-made AI beat o3's December score on Frontier Math by the end of 2025?

65% chance

Will any AI model achieve > 40% on Frontier Math before 2026?

83% chance

Will an AI achieve >85% performance on the FrontierMath benchmark before 2028?

76% chance

Will an AI achieve >85% performance on the FrontierMath benchmark before 2027?

60% chance

Will Al achieve 95% or higher score on the FrontierMath benchmark before 2030?

73% chance

Will Al achieve 85% or higher score on the FrontierMath benchmark before 2030?

87% chance

Before what year will Al achieve 85% or higher score on the FrontierMath benchmark?

Before what year will Al achieve 95% or higher score on the FrontierMath benchmark?

Will an AI be capable of achieving a perfect score on the Putnam exam before 2026?

26% chance

Will any AI model score above 95% on GRAB by the end of 2025?

Sort by:

As a Math PhD student who uses LLMs for math every day, I can say that there is still a long way to go before LLMs can effectively solve these problems. The only scenario where I see this happening soon is if models like AlphaProof scale significantly more than expected within the next year.

@Grothenfla have you tried using our lord and savior o1-pro ?

@MalachiteEagle Not yet... But o1-pro's AIME benchmark is similar to this new model Kimina (86%). I believe these formal theorem provers are the real contenders.

sold Ṁ12 YES

o4 is coming

@MalachiteEagle by end of year we might already have o6 (build into gpt5 or whatever they will call it)

@sponge can you clarify whether you will use official epoch results or outside claims?

fk it i'll buy more at 35%

@Bayesian Did something change your mind? It seems like Grok 3 isn't quite at the level of o3 on math and coding benchmarks, even with reasoning enabled.

@TimothyJohnson5c16 polymarket has a market that has given me confidence that I am right

https://polymarket.com/event/ai-model-scores-90-on-frontiermath-benchmark-in-2025?tid=1739933819530

(90% is much harder than 80% bc there's around a 10% error rate, ie ~10% of problems can't be solved correctly according to the benchmark)

@Bayesian Yeah, if I were on Polymarket, I would just call that free money.

@Bayesian Though of course, 10% error rate is also an estimate, right? In the world where an AI model reaches 90%, the error rate is probably a lot lower.

@TimothyJohnson5c16 yeah 10% error is an estimate, the errors they found were like 6% iirc, and they estimated # of errors they would have missed

@Bayesian that a big deal if there's a 10% error rate though right?

@NebulaByte I don’t understand your comment could you rephrase

@Bayesian what did you mean by 10% error in your comment above?

@NebulaByte i think i mean that about 10% of (question, answer) pairs in the benchmark are eroneous in some way, either that the question has some hidden assumptions that make it not necessarily solvable, or the answer is incorrect (the person solving the problem to add it to the benchmark made a mistake in some step) or something like that