Resolvse to the opinion of the AI safety community, whenever information that could resolve this comes out and consensus is reached, on my judgement or the judgement of a moderator resolution council if a single person disputes it
This benchmark shows it only marginally improves the score. I mean sure it is better, but it also thinks way longer. Comparing to traditional benchmarks is also misleading, because it uses multi-step thinking, which could be trivially added to e.g. Claude as well using Auto GPT or similar, would be interesting to see a comparison then.
@jacksonpolack for the purpose of this question you count o1 as being Q* right? OpenAI doesn't need to explicitly mention the old name?
Hm.
In spirit, the idea is if it's something worth being interested in or nervous about in terms of AI capabilities. So the idea is, if I'm thinking about AI safety, or the general rate of AI advancement, should I pay any attentiont to what Q* is? This is obviously pretty fuzzy, but I don't think there's a less soft way to make a market on the topic, considering I don't know too much about what the thing is or what it accomplished.
Some potentially useful clarifications :)
- does it need to have ultimately been related to the firing?
- some anchors (e.g. would Transformers, GPT-3, GPT-4, RLHF, AlphaGo, AlphaZero, OpenAI Five etc be counted as 'significant capabilities advances')
- If Q* is a model, does it matter if the underlying approach needs to be subsequently scaled up?