Will an AI win a gold medal on International Math Olympiad (IMO) 2025?

154

1kṀ110k

Aug 20

68%

chance

ALL

Will an AI score well enough on the 2025 International Mathematics Olympiad (IMO) to earn a gold medal score (top ~50 human performance)? Resolves YES if this result is reported no later than 1 month after IMO 2025 (currently scheduled for July 10-20). The AI must complete this task under the same time limits as human competitors. The AI may receive and output either informal or formal problems and proofs. More details below. Otherwise NO.

This is related to https://imo-grand-challenge.github.io/ but with some different rules.

Rules:

The result must be achieved on the IMO 2025 problemset and be reported by reliable publications no later than 1 month after the end of the IMO contest dates (https://www.imo-official.org/organizers.aspx, so by end of August 20 2025, if the IMO does not reschedule its date. Local timezone at the contest site).
The AI has only as much time as a human competitor (4.5 hours for each of the two sets of 3 problems), but there are no other limits on the computational resources it may use during that time.
The AI may receive and output either informal (natural language) or formal (e.g. the Lean language) problems as input and proofs as output.
The AI cannot query the Internet.
The AI must not have access to the problems before being evaluated on them, e.g. the problems cannot be included in the training set.
- (The deadline of 1 month after the competition is intended to give enough time for results to be finalized and published, while minimizing the chances of any accidental inclusion of the IMO solutions in the training set.)
If a gold medal score is achieved on IMO 2024 or an earlier IMO, that would not count for this market.

AI Math

Technology

Technical AI Timelines

Math

IMO Grand Challenge

Get

1,000

to start trading!

People are also trading

AI IMO 2025: How many AI labs announce a Gold performance at the IMO in 2025?

Will an AI publicly accessible before IMO 2025 get the gold medal on it?

22% chance

Will AI win a gold on IMO before it wins a gold on IOI?

67% chance

Conditional on AI getting at least bronze on the IMO, will it get gold by 2025?

80% chance

Will there be 2 or more combinatorics problems on IMO 2025 x Will AI get a gold medal on IMO 2025?

Will an AI get at least silver on International Olympiad in Informatics (IOI) by end of 2025?

99% chance

Will AI first get IMO gold in 2026 or 2027?

Sort by:

Money markets are trading much lower -- 30% vs 70% on manifold.

@jgyou those are probably about an open source AI getting gold

Just to be clear, this AI is a singular AI?
Not a combination of multiple LLMs, ...

@manifoldgod Any AI system. Many are already built out of multiple sub AIs, that still counts.

@manifoldgod there is no definition of LLM or AI that would not qualify ensemble models as LLM or AI, so this question is meaningless outside of extreme pedantics

Leading LLMs get <5% scores on USAMO (which selects participants for the IMO): https://arxiv.org/abs/2503.21934

@pietrokc yeah I saw this, very strange - hard to see how this dovetails with the really high performance we see elsewhere - I mean it seems to just speak to train / test contamination

but was frontier math also contaminated?

current llms trained with rl for reasoning largely do it on short solution based problems, not proof based problems. so they learn to take shortcuts. for proof based problems they are currently pretty bad. that is the essence of the difference ; frontiermath is not proof-based. USAMO is proof based. LLMs currently do good on one, bad on the other. For proofs, the best current systems seem not to be LLMs, but systems like alphaproof by google.

bought Ṁ500 YES

@Bayesian yeah, the way AlphaProof/AlphaGeometry avoid making reasoning mistakes is simply by requiring formal proofs, instead of LLMs which generate informal proofs.

@Bayesian It's very misleading to say that FrontierMath is not proof-based. Of course it's proof-based. All real math is proof-based. They just ask that the proof be of a fact of the form "certain definition picks out a certain number", to make it easier to check automatically.

@CampbellHutcheson There's been a lot of controversy about FrontierMath which I'll not rehash here. In my personal experience, all models fall FAR short of claims that they can do "research-level math that would take a professional mathematician hours or days". They routinely fail relatively trivial things whenever I test them. I have also tried to earnestly use them to learn actual math, like, existing fields that I'm just not that familiar with. I have found them to be worse than useless at that, because they'll confidently state falsehoods that take effort to disprove.

bought Ṁ150 YES

@pietrokc Gemini 2.5 does a lot better. About 25%

@Usaar33 USAMO 2025 was on 19-20 March. Then someone evaluates Gemini 2.5 on 2 April and it does massively better than models released before that date. What conclusion do you want to draw from this?

@pietrokc I am not following. How is FrontierMath proof based? They don’t look at the reasoning, only at whether the answer was correct. The ai can find the right answer by coincidence or by wrong reasoning cancelling out and it’s still graded as correct. Unlike with proofs

@pietrokc

It's very misleading to say that FrontierMath is not proof-based. Of course it's proof-based.

Bayesian is correct—obviously most math is “proof-based” in some trivial sense that isn’t really relevant here. What matters here is whether you are scored based on the correctness of the complete proof you produce. Many recent LLMs have struggled on such tests (even when they have performed well on other math that just require a narrow correct answer—you are welcome to call that number a “proof”, fair enough, but it’s not the relevant distinction here)

@CampbellHutcheson Frontier math is just very different type of problems than competition problems. At least from the public problems these are more about chaining together facts which few people know to reduce the problem to some computation. Because the computation involves complicated objects it often has to be done by hand which is the thing that takes hours for experts.

Competition problems are supposed to be the opposite, they should only require the knowledge of high-schoolers, the difficulty in them is thus orthogonal to the difficulty of Frontiermath problems.

Human contestants only get one chance to answer each question. I hope that means AI will not be judged on pass@k, where it gets k>1 chances to give a correct proof and gets points if at least one is correct. Each AI should also be judged on one submission for each question, right?

@pietrokc No. If we are talking about formal proofs, LLMs can do pass@10^9 because what matters is solving problems not how efficient you are.

If we are talking about informal proofs, it will be pass@1 because nobody will read and grade 10 samples of solutions.

opened a Ṁ2,000 NO at 59% order

2000 limit no at 59%

“no later than 1 month after the end of the IMO contest (so by end of August 20 2025, if the IMO does not reschedule its date).”

What time zone?
Even though the timeline on the website has like 10 days, the actual contest is some time in the middle of those dates, so there’s technically a month and a few days where the problems are available.

Good questions.

Local timezone at contest site.
I'm going to use the end of the IMO dates as written, from https://www.imo-official.org/organizers.aspx even though the actual contest is in the middle, because that's what I wrote and it doesn't really matter the exact number of days.

bought Ṁ350 YES

So by these criteria, it's fine if the AI isn't finalized before the IMO, as long as it doesn't train on the IMO problems? This seems like it opens the possibility for small tweaks to the program to be made that bias the algorithm to be better at some tasks than others, and for the nature of these tweaks to depend on the content of the problems.

Right, you could just try many versions of something like this year's AlphaProof, and one would very likely qualify by chance.

This is also unlikely to be something the public or "reliable publications" could verify (hence the open source requirement for the IMO Grand Challenge), so it seems we'd just be taking the AI developer's word for it.

Note that in a lot of IMO criteria, like Eliezer's, the AI can be produced long after the contest and you mostly just have to trust the AI developers on whether they cheated.

While you can run multiple versions, you could already do that anyway, the only difference is that you might have humans decide different tweaks to try based on the problems (juicing the evals) or sort of cheat the time limits by not counting the time used for earlier versions you tried. So at least the cheats are much more limited.

Most models are closed and it is quite likely that the model will never be published, unless they are specifically going for the IMO grand challenge. So it's very hard to set requirements around the AI being finalized before the competition, unless you have an open model requirement.

Right, you could just try many versions of something like this year's AlphaProof, and one would very likely qualify by chance.

I highly doubt it would be able to solve the combinatorics problems no matter how many versions you tried.

And if that worked, then your winning AI system is just the collection of versions of subagents. (Assuming as mentioned above that you don't have humans deciding the tweaks based on the questions, and that it isn't cheating the time controls)

Overall I think my criteria balance false positive (cheating) and false negative potential about as well as possible. I haven't seen or thought of any verification requirements that would have prevented the hypothetical cheating scenarios above and allowed IMO silver to resolve yes on the deep mind announcement (if it had met the time controls), and I definitely want my question to resolve yes on that

@jack

I highly doubt it would be able to solve the combinatorics problems no matter how many versions you tried.

We are probably referring to different levels of model capabilities. I see a lot of probability mass on models that are correct, say, 5-50% of the time.

I'd agree that trying to resolve YES on the recent GDM announcement makes it hard to use strict criteria.

Yeah, I was referring to capability at the level of alpha proof right now.

Alpha proof is already trying tons of different proof strategies and checking to see what works!

People are also trading

AI IMO 2025: How many AI labs announce a Gold performance at the IMO in 2025?

Will an AI publicly accessible before IMO 2025 get the gold medal on it?

22% chance

Will AI win a gold on IMO before it wins a gold on IOI?

67% chance

Conditional on AI getting at least bronze on the IMO, will it get gold by 2025?

80% chance

Will there be 2 or more combinatorics problems on IMO 2025 x Will AI get a gold medal on IMO 2025?

Will an AI get at least silver on International Olympiad in Informatics (IOI) by end of 2025?

99% chance

Will AI first get IMO gold in 2026 or 2027?

17% chance

People are also trading

People are also trading

Related questions