
Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.
This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.
Claude 4 Opus gets 58% on SimpleBench, a ways off from saturation, but not two years off (and the hardest SB questions do not appear to be "extremely obvious"). Giving LMs code execution solves strawberry shenanigans and the 9.11 stuff. If we're talking about text-only queries, what are the remaining classes of "egregious errors" that LLMs continue to make?
@TiredCliche I'm not sure if the "average teenager" can play legal chess moves from an ASCII art chat window. It's not blindfold but it's also not the same as a physical chess set or a digital chess game. And the average teenager hasn't played chess for over a year.
@MartinRandall I just don't think the average teenager from an ASCII art chat window, given reference to the rules of the game, would repeatedly try to invent new pieces.
But I don't think that matters a ton, I am not under the impression that LLMs can play legally even given image data. I suspect they might actually get more confused.
@AdamK LLMs can solve things like strawberry and 9.11 with code but that doesn't mean they will do so if you ask the question without instructing them to use code. these sorts of mistakes still pop up sometimes and would count for this market.
@JoshYou AI Explained says he thinks Simple Bench won't last more than "3-12months maybe?"
7:15 in this video: https://youtu.be/jWsd2fRzpUo
@Mactuary I'm generally betting on slower AGI timelines but from my own experience with o3, I agree. I think there's uncertainty on how this would resolve today, let alone in 2028.
@Mactuary Read the comments there!
o3 (and all SOTA llms) are very impressive and useful but still very easy to trip up
What type of LLMs, @ScottAlexander ?
Transformer based? SSMs? MOEs?
What if transformer based LLMs are no longer the SOTA by then = /firstuserhere/on-january-1-2027-a-transformerlike-d56426e3f49e ?
Architecture invariant?
Would a black box system qualify, where it is known that one of the components of the system is a component to filter for things that may trip LLM up?
What would happen if the prompt that Gary marcus passes to the LLM does not reach the LLM?
i.e. it is modified on the way from his user-input (such as how DALLE-3 or Claude Opus write prompts)
i think scott is reasonably excluding token parsing errors which are orthogonal to llm reasoning capability. it's a quirk of conversion to embeddings and not a high priority one for openai to fix.
perhaps the unreasonable part is where he didn't explain his thought process. but people get busy
this market and friends would probably be better off as a poll due the legion amount of ambiguities.
I'm about 99% that this market and others of this ilk will resolve this based on how folks are vibing at the time.
ie: don't take them too seriously.
If you are interested in creating a serious market, take a look at openai/evals. Some stuff there could be used (including my grade school algebra questions! :)
@ScottAlexander Can we get some more clarity on this market? What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?
"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know how to predict what else is likely to be excluded.
@YuxiLiu mildly wanting to make an actual question on this, the problem is operationalizing "egregious errors". Gary Marcus is unlikely to admit to his own egregious errors.