Will a Chinese-made AI beat o3's December score on Frontier Math by the end of 2025?
➕
Plus
55
Ṁ17k
2026
80%
chance

On December 20, 2024, OpenAI reported that their o3 reasoning model scored 25.2% on EpochAI's Frontier Math benchmark. For context, AI models like GPT-4 and Gemini score around 2%. Will a Chinese-made AI model surpass that score in 2025?

Resolution Criteria

This market will resolve YES if:

  • A Chinese company, university, or government entity reports an AI model (e.g. DeepSeek or Qwen) scoring higher than 25.2% on the Frontier Math in 2025

  • The score is publicly announced and independently verified by EpochAI

The market will resolve NO if:

  • No Chinese-developed AI model surpasses 25.2% on Frontier Math in 2025

  • It eventually comes out that a Chinese model created in 2025 surpasses 25.2% on Frontier Math, but this wasn't widely known as of the end of 2025

Other Notes

  • This market is based on o3's December score of 25.2%. If o3 later surpasses that (for instance, by re-running with more inference compute), the new score won't supersede this one

  • If there's any uncertainty as to whether a model is "Chinese-made," I'll add clarifications as I see fit. Generally, I'll consider any model whose development was primarily conducted by a Chinese entity to be "Chinese-made"

  • Models may use any architecture and any amount of compute. I'm also including models that are specifically designed for math or research, not just general LLMs

  • If Frontier Math changes their benchmark (for instance, by adding a fourth tier of problems), I'll use my best judgement for doing an apples-to-apples comparison. If it doesn't seem possible to fairly compare results, I'll resolve the market at the current price

  • The model doesn't need to be publicly available, but the score needs to be publicly announced + verified

Get
Ṁ1,000
and
S3.00
Sort by:
opened a Ṁ500 YES at 60% order

@Fay42 Would you like to take a larger NO position? I set a limit order at 55%

@AdamK man, R1's aime performance was good...

@Fay42 Curious whether you were taking a no position bc you thought the math models wouldn’t improve fast enough outside OAI, or because you thought they wouldn’t be open-sourced

Combination of both - though less about OpenAI specifically and more about American vs Chinese speeds on frontier benchmarks. I still think Deepseek is in a plausibly bad spot with the new export restrictions but there's a substantial lag between export restrictions + the time at which those export restrictions impact models (since it takes time to get, install, and use gpus).

@Fay42 I think it's very likely that the compute difference requirements between o1 and o3 were small enough that DeepSeek could probably beat o3 on FrontierMath this year with literally no additional compute. (In principle by capabilities, but if the model is open-sourced, I see no reason why Epoch shouldn't test it)

@AdamK It's plausible doing the o3 eval cost hundreds of thousands of dollars, in which case Epoch would need to be willing to spend a lot on doing the FrontierMath eval themselves. I agree that it's plausible deepseek has enough compute to make an o3 equivalent already.

@Fay42 Sure, but the o-series RL paradigm is nowhere close to being scaled. I'm willing to bet that both OAI and DeepSeek will be spending 1-2 OOMs more compute than o3 on RL for individual models by the end of the year. The next reasoning model DeepSeek makes might be comparable to o3 with heavy inference, but the one after won't need nearly as much.

@AdamK I'd bet against Deeepseek doing 1-2 OoMs more than o3 within a year, but idk how to resolve such a bet. And note that they have to spend the compute, train the model, and then have it's inference be possibly an OoM cheaper for the same o3 level results. Though, there are a bunch of other possible paths to a Yes resolution on this market so idk.

@Fay42 I'm also not sure how to resolve. I do think you're either/both underestimating how much compute DeepSeek has/will have, and/or how little RL compute it likely took to make o3

@AdamK This is possible - I'm not at all confident about this.

for compute: My guess is Deepseek had 50k hopper gpus as of recently - most of which were probably h800 or h100s (this from dylan patel on twitter). Seems possible though not certain they will get more.

for o3 compute used: My guess, not taking into account uncrtainty downstream of you seeming confident, would have been ~20k gpus for upwards of a month or engineering/scaling challenges limiting the number of gpus to something possibly much lower. This is because I think they probably would be willing to devote that many gpus to o3 after thee success of o1 if they thought they could do so usefully.

@Fay42 These are in the ballpark of my estimates. I also think it’s very likely (75%, but I wouldn’t know how to resolve until way after the fact) that DeepSeek gets at least an additional 3x effective compute gain on math RL over R1 by the end of the year due to algorithmic improvements. Even on very generous assumptions about how long they were working on RL over CoTs before o1, there wasn’t all that much time until they demo’ed R1 lite. They’ll probably also have a lot to learn from algorithmic progress in the American labs in the coming months ;)

@AdamK Hm, I think with those compute numbers you probably won't get 1-2 OoM in compute over o3, excluding algorithmic improvements in effective compute. I think epoch's median number for algorithmic improvements do suggest 3x a year (with wide error bars) but I think the actual rate is lower.

I don't particularly expect them to learn enough from American labs in the coming months to make a big difference by the end of the year.

@TamayBesiroglu @ElliotGlazer Would be curious to hear if you have a policy (in mind or publicly stated somewhere) for which models will be evaluated on Frontier Math? It might be nice to commit to evaluating e.g. the apparent SotA open-source LLM on a quarterly basis.

bought Ṁ650 NO

Sorry, who says that EpochAI will even share their problems with Chinese AI companies? Trump is about to be President. China-US relations are probably not good and will likely get worse. People are concerned about fraud and such. Epoch might not trust China to leak the problems.

opened a Ṁ250 YES at 46% order

@nathanwei I think agree that this is the most plausible path to a NO resolution. I do think there is a very high chance that a Chinese AI will exist before 2026 that is in principle capable of beating o3's score; the main question is how they would interface with Epoch

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules