In 2028, will Gary Marcus still be able to get LLMs to make egregious errors?
➕
Plus
343
Ṁ120k
2028
39%
chance

Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.

This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.

Get
Ṁ1,000
and
S3.00
Sort by:

Claude 4 Opus gets 58% on SimpleBench, a ways off from saturation, but not two years off (and the hardest SB questions do not appear to be "extremely obvious"). Giving LMs code execution solves strawberry shenanigans and the 9.11 stuff. If we're talking about text-only queries, what are the remaining classes of "egregious errors" that LLMs continue to make?

@AdamK inventing new pieces during a game of chess.

@TiredCliche I did that as a teenager playing blindfold chess.

@MartinRandall Perhaps, but it seems odd to call this blindfold chess.

@TiredCliche I'm not sure if the "average teenager" can play legal chess moves from an ASCII art chat window. It's not blindfold but it's also not the same as a physical chess set or a digital chess game. And the average teenager hasn't played chess for over a year.

@MartinRandall I just don't think the average teenager from an ASCII art chat window, given reference to the rules of the game, would repeatedly try to invent new pieces.

But I don't think that matters a ton, I am not under the impression that LLMs can play legally even given image data. I suspect they might actually get more confused.

@MartinRandall telnet freechess.org 5000

bought Ṁ50 YES

@AdamK LLMs can solve things like strawberry and 9.11 with code but that doesn't mean they will do so if you ask the question without instructing them to use code. these sorts of mistakes still pop up sometimes and would count for this market.

@JoshYou AI Explained says he thinks Simple Bench won't last more than "3-12months maybe?"

7:15 in this video: https://youtu.be/jWsd2fRzpUo

Reasoning models seem to address a lot of these. I don't see o3 failing on his recent gotchas. He could come up with new ones, but they're already pushing up against the limits of a normal teenager.

Plus we're 3 years from this resolving and 2.5yr since the release of chatgpt

bought Ṁ7,000 NO at 39%
bought Ṁ7,000 NO

@Mactuary I'm generally betting on slower AGI timelines but from my own experience with o3, I agree. I think there's uncertainty on how this would resolve today, let alone in 2028.

@FergusArgyll I'll buy some No on that

@Mactuary Read the comments there!

o3 (and all SOTA llms) are very impressive and useful but still very easy to trip up

  • What type of LLMs, @ScottAlexander ?

    • Transformer based? SSMs? MOEs?

    • Would a black box system qualify, where it is known that one of the components of the system is a component to filter for things that may trip LLM up?

  • What would happen if the prompt that Gary marcus passes to the LLM does not reach the LLM?

    • i.e. it is modified on the way from his user-input (such as how DALLE-3 or Claude Opus write prompts)

i think scott is reasonably excluding token parsing errors which are orthogonal to llm reasoning capability. it's a quirk of conversion to embeddings and not a high priority one for openai to fix.

perhaps the unreasonable part is where he didn't explain his thought process. but people get busy

this market and friends would probably be better off as a poll due the legion amount of ambiguities.

I'm about 99% that this market and others of this ilk will resolve this based on how folks are vibing at the time.

ie: don't take them too seriously.

If you are interested in creating a serious market, take a look at openai/evals. Some stuff there could be used (including my grade school algebra questions! :)

predictedYES

Doesn't seem we're getting clarification on this, so I've made a duplicate of this market that removes the "bizarre hacking like tricks" exception.

predictedYES

@ScottAlexander Can we get some more clarity on this market? What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?

"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know how to predict what else is likely to be excluded.

In 2028, will LLMs be able to get Gary Marcus to make egregious errors?

predictedNO

@YuxiLiu mildly wanting to make an actual question on this, the problem is operationalizing "egregious errors". Gary Marcus is unlikely to admit to his own egregious errors.

predictedNO

Lol, two trades 10 seconds after my comment

predictedNO

@colorednoise Maybe 3 comments in a row from people predicting No -> bot trade?

predictedNO
Comment hidden
predictedNO
Comment hidden
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules