If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.
I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)
Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.
1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.
I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.
Update 2025-21-01 (PST) (AI summary of creator comment): - LLM identification: A program must be recognized by reputable media outlets (e.g., The Verge) as a Large Language Model (LLM) to qualify for this market.
Self-designation insufficient: Simply labeling a program as an LLM without external media recognition does not qualify it as an LLM for resolution purposes.
Update 2025-06-14 (PST) (AI summary of creator comment): The creator has clarified their definition of "blind chess". The game must be played with the grandmaster and the LLM communicating their respective moves using standard notation.
Update 2025-09-06 (PST) (AI summary of creator comment): - Time control: No constraints. Blitz, rapid, classical, or casual online games all count if other criteria are met.
“Fun game” clause: Still applies, but the bar to exclude a game as "for fun" is high; unusual openings or quick, unpretentious play alone don't make it a "fun" game.
Super grandmaster: The opponent must have the GM title and a classical Elo rating of 2700 or higher.
Update 2025-09-11 (PST) (AI summary of creator comment): - Reasoning models are fair game (subject to all other criteria).
Update 2025-09-13 (PST) (AI summary of creator comment): Sub-agents/parallel self-calls
An LLM may spawn and coordinate multiple parallel instances of itself (same model/weights) to evaluate candidate moves or perform tree search, including recursively. This is considered internal reasoning and is allowed.
Using non-LLM tools or external resources (e.g., chess engines like Stockfish, databases) remains disallowed.
Someone below mentioned that that the criteria are really specific. This is true, but also I think the conjunction of the specific things is even less likely than the individual things happening themselves. Why would the Super GM play blind if the LLM was good enough to make it a challenging game? At the moment it makes for good content because the LLMs just randomly play illegal moves, but if at some point they're actually good, I would expect the standard interface not to be blind chess any more.
In most worlds where LLMs are better than super GMs, I still don't think they ever publicly win a blindfold game against one.
[The] idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines).
a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.
Using non-LLM tools or external resources (e.g., chess engines like Stockfish, databases) remains disallowed
@ShitakiIntaki the LLM wouldn’t be using Stockfish though, it’d be using a set of weights that just mysteriously happens to act like Stockfish 🤷♀️
such a thing by design would be as generally applicable as a LLM, as long as you encode questions to it as chess puzzles where the answer corresponds to the correct move :P
(lastly, I think such a thing would be hard to market to anyone, so I can’t imagine it being professionally billed as a chess engine)
I am not familiar with the nuances but the proposal doesn't feel like it passes the "smell test" of not being "created specifically" to play chess. 😉
@ShitakiIntaki fair.
counterpoint: you are quoting rules about someone else’s fake money market you don’t control at a cartoon horse on the internet
@ShitakiIntaki There was a quote somewhere about how modern grandmasters have their fathers whisper them opening lines while they are still in the womb.
When most of the childhood memories of a GM is spent studying board evals, it's hard to claim that they aren't in some way specifically designed for the game.
@Lilemont GPT-3 intentionally trained on stockfish data under the hope to boost total IQ, which they later found out that has no use.
https://dubesor.de/chess/chess-leaderboard
This leaderboard paints a MUUUCH direr outlook. It says GPT-3.5-Turbo, a 2022 model plays at 1200 level, while GPT-5 at 1,500. Super human chess by generalist LLMs only in 2040.
@MP A 2400 elo has a 4 % chance of winning against a 2700 elo, and this does happen in tournament chess. So the more accurate linear fit is at 2037.
Realistically, I would bet no later than 2035 for the fulfillment of these criteria, assuming no inexplicable (or otherwise) slowdown. Why is noone pumping this market with NO shares if you think what you said is credible? Instead, it's at 52 %. FWIW I was a NO holder but I sold for now
Also, updates: GPT-5 codex is now 1596 elo in mixed model. GPT-5 was at 1485 elo on Sep 15.
It should be noted that GP5-5 codex is 1836 in reasoning (but 1284 in continuation).
@MP It's even worse than that. The ratings there are not standardized against human ratings, and I believe they're vastly inflated wrt FIDE ratings.
@MP Are you saying that the evaluation of playing skill by the best chess player in the world, Stockfish 17.1, is inflated? That's the base rating. It's extraordinarily unlikely that they're off by more than like 30 points.
@Lilemont Stockfish gives a reasonable accuracy score. However, accuracy score does not cleanly convert to rating in general. See here: https://lichess.org/page/accuracy
Additionally, the author converts it using a formula given in citation one here: https://dubesor.de/chess/chess-leaderboard The formula is:
Initial_Elo = 400 + 200 × (2^((Accuracy-30)/20) - 1)
Where:
- Accuracy = Average accuracy across first 10 non-self-play games (%)
- Accuracy is constrained between 10% and 90%
- Human players start at 1500 Elo regardless of accuracy
- Default fallback: 1000 Elo if no accuracy data available
I have no idea where they got this formula from. It's probably fine for creating a leaderboard where ratings are self-consistent, but unless the author provides data to the contrary, I don't think there's any reason to think this conversion from accuracy to ELO remotely corresponds to the rough correlation that would be found between human accuracy and FIDE ELO.
@Lilemont To your edit, I believe they are not off by just 30 points but many, many hundreds of points. Are you a chess player who has played the models? I'm a National Master and I believe there is absolutely no way the models are that strong.
@Lilemont This leaderboard seems to me to be much more accurate: https://maxim-saplin.github.io/llm_chess/
Unfortunately it doesn't have some of the latest models. But, they fix the exact problem that I was talking about: "We've added the Komodo Dragon Chess Engine as a more capable opponent, which is also Elo-rated on chess.com. This allowed us to anchor the results to a real-world rating scale and compute an Elo rating for each model."
@DanielJohnston This leaderboard assigns negative elo, which is usually not done in chess.
No, conversion based on accuracy rating is not perfect, but it is approximated based on performance of human players, and correlation.
@Lilemont You're right that negative elo is usually not done. However, that's for practical rather than theoretical reasons. The USCF actually did used to have negative ratings a few decades ago, but removed them (I'm guessing for psychological reasons lol) and instead instituted a rating floor at 100. Having negative ratings rather than a rating floor actually makes the rating system more accurate and consistent, as a rating floor creates inflation. FIDE long avoided this problem by simply bumping anyone below their minimum rating off the list entirely.
Yes, a conversion based on accuracy could be relatively accurate, particularly if it takes into account position complexity, whether the opening is already known, etc. My understanding is that Ken Regan has a way to measure rating quite accurately based on games. However, as I said, I don't know the source of the conversion formula used in the leaderboard we're talking about and I don't know of any reason to think it's an accurate one.
@MP If this leaderboard is any guidance, you had a 600 gap between o1 (Dec/24) and GPT-5. Let's say AI will improve 600 points per year. You imagine a LLM that can play AT Super GM level in 2028. And super human chess in 2029.