In what year will AI achieve a score of 95% or higher on the GPQA benchmark?

Ṁ28

2030

Invalid contract

Background

GPQA is a graduate-level, 448-item multiple-choice benchmark covering biology, chemistry and physics; it was designed to be “Google-proof,” defeating web-search tricks and challenging even PhD holders ― human experts average ≈ 65% accuracy, while skilled non-experts manage only 34%.

Because GPQA questions are publicly released, leaderboard results come from independent community test harnesses (e.g., Vellum AI, LLM-Stats, xAI livestreams). The best-known AI score today (July 2025) is 88.4%

Resolution criteria

The market resolves to the first year in which ALL the following conditions hold:

Score threshold – A single fully-autonomous system achieves ≥ 95% average accuracy on the standard 448-question GPQA.
Verification – The result is confirmed by either
- a) a peer-reviewed or widely-cited paper (e.g., arXiv, NeurIPS) including full evidence or
- b) an official public leaderboard entry (e.g., Vellum AI, LLM-Stats, or a GPQA maintainer-run board).
Autonomy – After evaluation starts, no human may alter answers; chain-of-thought can be hidden, but tool-use (e.g., Python, calculators) must be invoked autonomously.
Expiry – If no qualifying run is verified by Jan 1, 2030 the market resolves “Not Applicable.”

This question is managed and resolved by Manifold.

#️ Technology

#AI

#Technical AI Timelines

#OpenAI

#AI Impacts

Get

1,000

and

3.00

Comments

1 Holder

4 Trades

Invalid contract

Background

Resolution criteria

Related questions

Related questions