As judged by me following general expert consensus (not just claims by OpenAI)
@Philip3773733 well, they did some kind of new chain-of-thought training paradigm, but it is true that we have as far as I can see not much information on what exactly. You could characterise “breakthrough” either as “how large is the performance gain” or as “how clever/novel is the method”
@Donald This benchmark shows it only marginally improves the score. I mean sure it is better, but it also thinks way longer. Comparing to traditional benchmarks is also misleading, because it uses multi-step thinking, which could be trivially added to e.g. Claude as well using Auto GPT or similar, would be interesting to see a comparison then.
@Philip3773733 ill take a more comprehensive look at the different benchmarks a bit closer to the resolution date. for example those academic benchmarks that where provided by openai had some accuracy increases of 30% or so which is pretty huge, although of course relying on openai to benchmark their own product is not how i will resolve this market.
@Donald Do you agree that this resolves YES? have you seen the new o1 family of models (which is the new name for Q* it seems like)
@Bayesian from the information we have so far, I’d say it seems likely. I’m not sure we will get more specific information on the o1 model architecture and Q*, which would be nice for a clean decision