Inspired by this tweet thread: https://twitter.com/colin_fraser/status/1632598168499277824?s=20
The Numbers Game is defined as follows:
Player 1 picks a number between 1 and 10, and then Player 2 picks one, then the players go back and forth and keep track of the running total. Whoever picks a number that brings the total to at least 30 wins.
For example:
P1: 10 (total: 10)
P2: 5 (total: 15)
P1: 7 (total: 22)
P2: 3 (total: 25)
P1: 6 (total: 31)
And P1 wins. Note that P2's choice of 3 was a blunder, as they could have instantly won by selecting 8 or higher. As of right now, GPT3.5, and GPT4 (as best as I can tell via Bing) are terrible at this game, and make unforced blunders essentially every time. Here's an example:
While technically this game is a forced win for P1 (8, 1, 10, 1, 10), GPT does not even come close to optimal play. In this example, it fails to take advantage of an immediate win by choosing 8+ when the running total is 22.
When GPT-5 is released, I will test it as soon as I have access using the same prompt in the screenshot above. For each of my turns, I will randomly generate a number uniformly 1-10 inclusive.
If out of 20 trial games, GPT-5 takes advantage of at least 90% of "immediate wins" (situations where it could immediately win the game with the right numer), this market resolves YES. Otherwise, it resolves NO.
This is not a challenge market. I won't be cleverly changing the prompting or hinting at strategies to suggest certain behaviors, just directly testing in the same format as the screenshot above.
See partner market here:
I haven't done your 20 turns testing because of the 25 message/3 h limit but if you want I can record a video for you tomorrow of it doing it. Oh and it also wins even when it looses so the win rate is 100%.
Would be interesting if someone wants to test it via GPT-3.5 API with temperature set to 0.
@light Thank you for this analysis! Super interesting, I should have tested more thoroughly before assuming GPT4 couldn’t do it.
Since the market is about GPT5 I’ll leave it open, and imo the partner market about having one layer of strategy is still valid.
I would love to send you a manalink for this if you have a discord or Twitter I could dm you on?
@DanMan314 Sorry for late response, I'm in CEST so it was like 2-3 AM when I wrote all of this. Sure add me on discord light#5957.
You can resolve this however you wish since it's your market, but surely GPT-5 wont be worse than GPT-4. And since all of this is public we've now added it to its training data...