Will a large language model beat a super grandmaster playing chess by 2028?

1.6k

4.4kṀ910k

2029

70%

chance

ALL

If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.

I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)

Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.

1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.

2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.

I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.

Update 2025-21-01 (PST) (AI summary of creator comment): - LLM identification: A program must be recognized by reputable media outlets (e.g., The Verge) as a Large Language Model (LLM) to qualify for this market.
- Self-designation insufficient: Simply labeling a program as an LLM without external media recognition does not qualify it as an LLM for resolution purposes.

AI Capabilities

Technology

Technical AI Timelines

Chess

Get

1,000

to start trading!

People are also trading

When will a Large Language Model beat me at chess?

Will an AI image generation model successfully generate a proper chess board by July 31, 2025

78% chance

Will there be a new youngest Chess grandmaster before the end of 2025?

33% chance

Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?

51% chance

Large language model at least diamond in StarCraft 2 by 2028?

16% chance

Will an AI model achieve superhuman ELO on Codeforces by the 31 December 2025?

80% chance

Which of these Language Models will beat me at chess?

Will an LLM (a GPT-like text AI) defeat the World Champion at Chess before 2035?

72% chance

Will a Language Model under 10B parameters play chess at Grandmaster level by 2050?

63% chance

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2025?

Sort by:

opened a Ṁ2,000 YES at 64% order

RL finally works at scale with LLMs and this market doesn't shoot to 100%? what are we doing

@Bayesian yeah seems like a lock

@Bayesian a lot of people don't realize how much better a super grandmaster is than a regular human.

I feel like GPT 4.5 can play chess better consider it is a larger model than GPT-4o and 4. It plays much better than most LLMs

A serious consideration of this market is whether a Super GM is even going to bother to play a capable LLM by 2028. Im never even half sure that will occur. It would be an interesting novelty for one of the streamers - but who knows which open/free model they'll play against.

@gamedev I think it's a given (95%, say) that if an LLM is reported to be GM level, a super GM will get content out of playing against it. There is a serious risk (and I'm not sure it's intentional*) that they would not play it blind, either by chance or because it is good enough that the GM doesn't feel the need for a handicap.

*There is a conversation to that effect here: https://manifold.markets/MP/will-a-large-language-models-beat-a#YwtaHaw89NHjtTYU5dYC, but it seems against the spirit of the market to resolve No if an LLM is beating GMs and they're not disadvantaging themselves further

@Frogswap my point is that I don't want to have the AI need to be fed the state of the board. But in the year of our lord Jesus Christ of 2025 in which 2.5 Pro spits 1300 lines of python, it seems trivial that it cans haha.

Someone can develop a small benchmark. I bet most SOTA non-reasoning LLMs like Deepseek V3 can deduce the board state from a list of moves. Imagine with reasoning.

@MP The current ones do try to do that, especially with reasoning, and they still get it wrong.

When I played o1-pro, it was trying to do that, but that did not stop it from attempting to move through pieces and other similar errors.

3.5 instruct was good because it was evaluated on chess. Now it's not in the benchmarks.

But by 2028 the reasoning models should be search models.

I think it's possible with a fine tuned LLM but the generic LLMs may be grandmaster level. It depends on how much compute they're allowed.

If limited to 4k tokens probably not but if they can do something like Deep Research where they output something on the order of 1M tokens they should be able to reach GM level via searching the game tree.

If Chess isn't a training metric I doubt LLMs can generalize that well without RL (from Stockfish)

Don't want to bet on this because of too much uncertainty on whether anyone puts sufficient effort into incorporating chess abilities.

@ChinmayTheMathGuy I played Sonnet 3.5 against o3-mini-high. Sonnet was playing better most of the early game and gained a piece, got into a better overall position.

Later in the game, both models increasingly lost track of the board, and I had to block more and more illegal moves (sometimes 5 moves before they selected a legal one.) They also generally did not avoid or take advantage of obvious blunders. Sometimes even when their own comment should have revealed the blunder, e.g. "I'll move my queen to c4, attacking your bishop on d5." In fact, this sort of comment usually caused the other model to e.g. retreat the bishop instead of taking the queen. o3 in particular moved a rook deep into Claude's territory, intentionally "attacking" pieces in positions where the rook could be freely taken by a pawn or a bishop multiple times, but Claude never did that until forced to.

Eventually I gave up because the number of illegal moves essentially made the game random.

Would an LLM that implements self training whose dataset includes chess games count?

@DanielSacks idk. We'd need to think about it.

This may be an tougher challenge than expected it has 2 years since 2023 and no LLM has even come close to that elo the only llm that came semi-close in chess by just predicting moves based on its dataset was the mysterious gpt 3.5 turbo instruct. If Llms doesn't start playing chess and does not fail to keep track of the board state. I will have to sell by late 2026 or 2027. This is very concerning since this is almost 2 to 3 years away

@Blocksterpen3 Related, o1 pro lost to me easily (which was only the 2nd game of chess I played in years.) It also repeatedly got confused about the state of the board.

https://chatgpt.com/share/675e2bbb-2e88-8009-8382-b72bd610253c

@DavidBolin yeah I hope o3 or even o4 can play a coherent game of chess. Even deep seek r1 fails around move 13

@DavidBolin

LLMs get better at chess when given three examples of legal moves and their results and asked to repeat the entire previous set of moves before each turn. This can likely be applied to any game.

https://dynomight.net/more-chess/

Is this with no prompting?

bought Ṁ15 YES

Question: if o3 does this, would it resolve as yes? Also what does blind chess mean in context of a language model?

@RossTaylor I'm very confident that o3 will not beat a super GM

@AdamK Doesn’t answer the question - would that resolve as a yes if it did?

@RossTaylor Assuming o3 is also text-only, then yes. The "blind" criterion just means the model doesn't get to see pictures of the board

@AdamK Thanks for clarification! Is model allowed to imagine board states in its chain of thought?

@RossTaylor That's what it would have to be doing implicitly for the CoT to be useful. o3's CoT is almost certainly just text.

It's starting to look like this market is just a countdown to whenever one of the frontier labs decides to apply reasoning post-training to chess.

@AdamK and to whether a super grandmaster is bored enough to play vs an AI at chess.

@AdriaGarrigaAlonso I think that is actually a rather significant component of this question. You could reframe the resolution as "Will a super grandmaster play a serious game of chess against an LLM by 2030?". Even if LLMs continue to improve at chess (they currently aren't any good), this other contingency has to hold as well. Current market seems high.