By when will Kenshin9000 (or anyone else) “defeat all chess bots” using LLMs? (Permanent)
Basic
78
168k
2100
16%
2024, by Election Day
12%
2025 or earlier
19%
2026 or earlier
24%
2027 or earlier
26%
2028 or earlier
26%
2029 or earlier
26%
2030 or earlier
42%
2040 or earlier
46%
2050 or earlier
50%
2100 or earlier

This market resolves each option as NO if the date passes and Kenshin9000 (or anyone) has not defeated stockfish with an LLM-based chess engine.

All remaining options resolve YES once an LLM-based engine defeats stockfish (or top engine).

My resolution criteria are more strict than Mira’s:

  1. The LLM engine must have higher ELO than the latest stockfish (or whatever the top engine is at resolution time) at blitz timings with 99.9% confidence and be reproduced by 3+ people.

  2. The LLM engine must not use another chess engine at runtime.

For the purposes of this market, Large Language Models are 100M+ parameter general-purpose generative text models. A fine-tune of an LLM is ok, but the model cannot be solely trained on chess data. An LLM-based engine may use search, but node evaluation must be performed by invoking the LLM on each node (similar to AlphaZero, which is a DNN+search).

The LLM engine and Stockfish will run on the same hardware with the same time controls. The testing hardware should be either a commodity desktop or equivalent to the TCEC or other popular chess software tournament standards.

Get Ṁ600 play money
Sort by:
2024, by Election Day
opened a Ṁ550 2024, by Election Day YES at 11% order

sell order at 11%

opened a Ṁ25,000 2030 or earlier NO at 26% order

stockfish already uses a deep neural net to evaluate terminal nodes in the search. But it is specialized for chess. A net trained on internet text will never be competitive with the specialized one.

Actually, Stockfish doesn’t have a deep NN, it uses NNUE, I think with like four layers these days. LeelaChessZero does have a deep net so your point still stands (and it prolly would even if Leela didn’t have a deep net)

But, concept pinning!

Interesting error, maybe gone after my next trade. 😁

bought Ṁ25 2100 or earlier NO

Doing my bit to stop this absurdity

Maybe the question should say stockfish not "all chess bots"

@Daniel_MC Nope, see the description. Each option resolves against the top engine at the time.

Ah my bad

opened a Ṁ500 2024, by Election Day NO at 10% order

@someonec5dd very confident in Kenshin...

No incentive for long term bets anymore, but I think this market should be much lower

@someonec5dd the jig is up, no more free Mana 📉

@jgyou what?

@jgyou lol, that's not the end of free mana. More like the end of loans.

What is definition of LLM engine? Model trained on generic textual data? Can it have other components or must it be pure LLM model?

@Weezing An LLM engine is a chess engine which uses an LLM for node evaluation. It may still use search, but can’t use a non-LLM evaluation function.

@Paul And what is LLM in this context? Can it be trained just on chess specific text (for example chess notation)? Or just generic text?

@Weezing Great question. I’d say that it has to be a /language/ model, meaning general purpose, not chess-only training. A fine tune of a general purpose language model is fine, but a chess-only transformer model is not.

@Paul Wait a sec, this is completely different than what I thought the market was about when I bet! I thought we were betting on whether a LLM by itself could defeat stockfish, not a search engine that uses an LLM just for node eval. I wouldn't think of that as an LLM engine.

Like, taking AlphaGo as an example, it uses a neural net to direct the monte carlo tree search, so it's like half a neural net engine - the other half being the monte carlo tree search of course, which is also crucial to its success. I think calling AlphaGo a "neural network engine" would still be misleading. But using an LLM just for node eval is far less an LLM engine than AlphaGo is a neural network engine.

Also, what's stopping someone from just running the LLM engine run with a ton more compute = more depth than stockfish and "winning" that way? Are you requiring that they use the same amount of compute?

Btw I think the question of whether LLM+search can beat stockfish is much more interesting (because it's more plausible to actually happen), I just think it's extremely unclear from the question description.

@jack thanks for the feedback. I have updated the description to clarify the engine definitions and hardware/timing constraints.

Comment hidden

More related questions