If @Mira trains a transformer model to play Chess, what Elo rating will it get?

2kṀ3167

resolved Jan 3

ResolvedN/A

500

10%

1000

24%

1500

34%

2000

23%

2500

2%Other

I will calculate an Elo score for my model and resolve to it as the linear interpolation of surrounding entries.

Assume that Other has points at every 500 Elo points. I'll split it off if people really think it'll get high scores.

Architecture will be a simple transformer model, but I would put most of my effort into the data/reward. Curriculum learning with chess puzzles, reinforcement learning, self-play tournaments, etc.

Validation Elo will be calculated by playing random matches against a population of Stockfish settings at varying Elos until there has been 50 games since its previous all-time high. Stockfish has an "UCI_Elo" configuration that will likely be used. The average of the 50 games succeeding the all-time high will be used to resolve this market.

I am allowed to do legality checking. If my model gets fewer than 5% illegal moves, I would likely do legality checking for it(resampling once or twice) so I can test its play at higher ratings. But if it generates a higher rate of illegal moves, those games will count as losses.

@Mira won't trade in this market, and will sell at market if I accidentally buy some shares.

Personal Goals

Mira

Get

1,000

to start trading!

26 Comments

46 Holders

132 Trades

Sort by:

@mods What's the plan here?

@Mira any news?

@Butanium Mira's account is deleted now, so I think the mods may just N/A this if we don't hear anything.

@Butanium fwiw Mira is still around on discord

Why is it an invisible market?

I'm canceling a bunch of personal markets because real money trading with the pivot means they aren't really suitable.

But this one has reasonable number of traders and is decently defined, so I'll leave it alone for now.

As a project update: I'm doing LLM agent stuff for now, so I haven't started this. Remember that it does resolve 0 if I don't do it, unless the admins cancel it. I still intend to do it: It's not that much setup.

https://arxiv.org/abs/2402.04494

TLDR:

... we train a 270M parameter transformer model [on chess]

Lichess blitz Elo of 2895 against humans

Is the 1500 option a stand-in for something like [1500, 1999) or do all the lower ELO scores also resolve YES if the model reaches 1500?

resolve to it as the linear interpolation of surrounding entries.

@SavioMak i.e. 1200 ELO -> 60% 1000, 40% 1500

@SavioMak @Mira u confirm?

@jBosc have you read the first sentence of the description

@SavioMak i might have skipped it 😅

https://fxtwitter.com/a_karvonen/status/1743666230127411389?t=cJ8a04FFA9yZzRDCndYIBw&s=19

50M parameters gpt has 50% winrate against stockfish 1500

This guy claims a single dense layer can get 2k Lichess rating(not ELO):

https://github.com/thomasahle/fastchess

He's adding MCTS though the model itself with no search is claimed to be usable.

reposted

Reposting to the feed since I have decided to do this project, so it will not resolve NA. It will resolve 0 if I fail completely.

Very neat idea! I suggest you tie this to a specific method of Elo calculation (including that of the number of games played), otherwise the result is very ill defined.
In particular, I consider the suggestion of random moving opponent a poor choice. The Elo of that opponent is very low (many SDs away from that expected from an agent trained on existing PGN databases), and also estimatable only with large uncertainty. Human players are far from ideal, too. My suggestion is to use Stockfish at low skill setting (but with large hash and long thinking time, so that the nominal Elo is actually achieved), e.g. set to around 1000 Elo or so.

@Zozo001CoN Good suggestions. I'll use a population of Stockfishes set to various Elos, will play until the all-time high ELO of my model stops improving(for 50 games), and will take the average of the 50 games succeeding that all-time high. I'll randomly generate opponents within a 250 ELO range centered at its current rating.

Most free online chess sites use Glicko not Elo, so my fallback of playing some online games for a rating might not work so cleanly Some possibilities:

Elo vs. the agent that picks a move uniformly at random
Win rate vs. an engine + settings with a tested Elo(calibrated against chess engines)
Win rate vs. an engine + settings with a tested Elo(calibrated against human players)
Glicko/Glicko 2 converted to approximate Elo
Win rate relative to myself

I might report on all of these as needed. For training purposes, the "elo vs. the agent that moves randomly" would likely be my main metric. This is mathematically a clean metric, but traders here probably prefer to bet on human-calibrated scores.

For resolving, assuming it's not 0, I'll likely find several engines + settings with Elos calibrated against human players closest to my model, play them randomly, and calculate an Elo while keeping the other engines fixed. There's also an "implied Elo" if I calculate a win rate vs. a known Elo, that I can report for each engine in the population.

@Mira > Most free online chess sites use Glicko not Elo

With respect to the online bots, it is important to note (even besides them not using the standard Elo as you noted) that their self-proclaimed ratings are unverified AND likely overstated.

A particularly notable example in this context is the 'gpt35-turbo-instruct' bot on Lichess, boasting a 2350 "provisional" blitz rating with an actual strength likely below 1800 Elo.

I really doubt any ML-based approach can achieve much in chess without alpha-beta/monte carlo search.

Having said that, seems like an interesting experiment, and I'd like to see the results or help it out if I can

@Mira If it makes an illegal move does it lose automatically or is there some tolerance?

@Weezing Curious too if a grammar constraint is allowed here....

@Weezing Some kind of explicit reward signal for making legal moves will be part of the training so hopefully that's unlikely. I might sample it a few times until I get a legal move if it's a rare event(<5%), but if it's producing illegal moves all the time I'll just count them as losses and resolve 0. It won't be able to rely on error-correction in the normal course of playing.

@Fern For "grammar constraint": It won't be a typical LLM. It will be a transformer network. Could be restricted to FEN strings, but could also be purely in embedding space with something that knows how to encode entire chess boards into embedding vectors. I haven't decided how I want to represent the choice of move, and some representations make illegal moves much less likely.

@Mira I think you could have a chess engine running alongside of it and construct a 'plausible move grammar' from that, at least during training.

To prevent a combinatorial explosion in the tokenizer, you can have source and dest locations be a token tuple, sometimes this is more trouble than it's worth though.... (not sure which is best here, honestly).

One could also have a single token for every position in the 64x64 space but your embedding space would have to be at least 4096 to avoid collision issues.

You could also have a supervised legality loss that provides a strong signal as to which moves are legal each turn, this could be especially effective under a 64*64 tokenized scheme, and is basically 'free data' for the network each step and would be much more efficient than a motif where we implicitly learn solely from moves alone what is legal, and what is not.

Related questions