The AI may do retrieval over the internet for the Miklos case, but may not do retrieval for the IMO case. A combination of two models, e.g. a formalizer and a solver, would also qualify.
Any amount of prompt-engineering, rejection sampling, access to auxiliary systems is fine. A prompt which includes e.g. related IMO problems would however be disqualified.
Edit: starting the count at July 25, 2024 (this is slightly arbitrary since the DM system is not a unified system but following precedent of other Manifold questions' resolution.)
My rationale for HIGHER is https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer/comment/aA9mewsxZHuWfei7q; although I can't really guarantee it moves the gap to 4 years. It is also unclear how much retrieval will help -- maybe future models rely on querying the Internet embeddings a lot.
@dp Interesting, do you think there's an easy-to-describe benchmark which dodges this failure mode?
Separately, how likely do you think this sort of direct 'bashing' setup is likely to actually be constructed? There's a rather short time window for that to happen, before a more general LM-like solver is made.
@JacobPfau I don't have an inside view on whether any lab wants to do this. AlphaCode team made a large non-generalizable technical effort to beat some benchmarks; if there is a similar drive to just improve the numbers on math performance, I'd say the probability is >10% that the solution gives the language model query access to some "bashing" solvers.
As for an easy-to-describe benchmark you want, and the related issues with both calculator (or Python) and no-calculator math benchmarks, maybe I'll have some researchy stuff to say about it in the future... but my ideas are too unpolished to communicate right now.