o1-style models like OpenAI's o1 (https://openai.com/o1/) or DeepSeeks R1 (https://x.com/deepseek_ai) are generally much better at math problems, one-off coding problems, and so on than models not trained with RL over chain-of-thought.
However, such models generally seem much worse at other things -- they seem to work less well with pre-existing code; with verbal non-math tasks; with doing some translations; and so on. This is easily confirmed through vibes-based tests.
This question resolves positively if, on January 1st 2027, this tradeoff has been resolved. If there's a single model that can handle all the AIME, MATH questions that you like -- and which also you would be totally happy to have handle any other kind of question.
The question resolves negatively if, like today, on January 1st 2027 you'd prefer to use o1-like-models for some large class of tasks, but non-o1-like-models-like Sonnet or 405b for some other class of tasks. That is, if it's the consensus (basically) that there is a large tradeoff between different classes of models.
Basically:
- Positive -- RL v non-RL tradeoff has been eliminated (even if Anthropic's models are still better at, say, scholastic philosophy than OpenAIs).
- Negative -- RL v. non-RL tradeoff has not been eliminated
(This is one way of trying to operationalize future predictions of "superintelligence")
Update 2025-10-01 (PST): - Resolution Criteria Clarification:
The model must be the best across all domains at the time of resolution (January 1, 2027), not relative to models at the time of question creation.
For example, it must match the general abilities of a hypothetical Sonnet 4.5, not just the current Sonnet 3.5. (AI summary of creator comment)
What if in the future a model called o5 is able to perform even better at the AIME and math benchmarks compared to o3. But is also widely agreed to be more creative and better in every single way than Claude-3.5-sonnet?
Would that count? Or would it have to match specifically the general abilities of non-reasoning models that exist at the same time period?
@LDJ Good question.
I think for this to resolve positively, it needs to be (mostly) the best across all domains at the time of resolution, not relative to the time of question creation. That is, it would need to match the general abilities of a hypothetical Sonnet 4.5, not just the current Sonnet 3.5.
@1a3orn okay, one last clarifying question then:
What if the already announced O3 model ends up releasing soon for general access and is better in every single metric than claude-3.5-sonnet and GPT-4o, and no newer claude model or GPT model is out by then? Then I guess that would count even if a better claude model comes out a few days after O3 is made available right? because its whats available at the time of release that matters right?
Or rather, if its the announcement time that matters, than we may even have a Claude 4 model be announced and released tomorrow. But as long as the O3 model turns out to be better than Claude-3.5-sonnet, then I think that's all that would matter since that was the only other best general model available when O3 actually was "announced" in december 20th.