Will any plain transformer model achieve 60% or more on ARC-AGI-2 by 2030?
The inference cost to achieve this result does not matter.
The model that achieves this result must use the same "transformer recipe" common between 2023-2025: techniques like RLHF/RLAIF/CoT/RAG/vision encoders are allowed, but any specialized components must also be made of vanilla transformer blocks; Any new inductive biases, such as tree-search, neurosymbolic logic, etc. would not qualify.
The result must be verified by at least one reputable, unaffiliated org (ARC, Epoch, OpenAI Evals, academic lab, etc.) or a publicly re-runnable result (notebook on Kaggle, etc.).
Resolution uses the ARC-AGI-2 evaluation set and scoring script as published on arcprize.org on the day this market opens. Later revisions are ignored.
Honestly I don't believe 60% or more on ARC-AGI-2 is truly AGI in any meaningful sense:
Humans can score 100%, not 60.
It's a single benchmark that doesn't really test the full breadth of capabilities. It's definitely possible to have a system that's good at this benchmark while being useless in other tasks.
I propose renaming the question.