In what year will AI achieve a score of 95% or higher on the GPQA benchmark?
1
Ṁ28
2030

Invalid contract

Background

GPQA is a graduate-level, 448-item multiple-choice benchmark covering biology, chemistry and physics; it was designed to be “Google-proof,” defeating web-search tricks and challenging even PhD holders ― human experts average ≈ 65% accuracy, while skilled non-experts manage only 34%.

Because GPQA questions are publicly released, leaderboard results come from independent community test harnesses (e.g., Vellum AI, LLM-Stats, xAI livestreams). The best-known AI score today (July 2025) is 88.4%

Resolution criteria

The market resolves to the first year in which ALL the following conditions hold:

  1. Score threshold – A single fully-autonomous system achieves ≥ 95% average accuracy on the standard 448-question GPQA.

  2. Verification – The result is confirmed by either

    • a) a peer-reviewed or widely-cited paper (e.g., arXiv, NeurIPS) including full evidence or

    • b) an official public leaderboard entry (e.g., Vellum AI, LLM-Stats, or a GPQA maintainer-run board).

  3. Autonomy – After evaluation starts, no human may alter answers; chain-of-thought can be hidden, but tool-use (e.g., Python, calculators) must be invoked autonomously.

  4. Expiry – If no qualifying run is verified by Jan 1, 2030 the market resolves “Not Applicable.”

Get
Ṁ1,000
and
S3.00
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules