https://arcprize.org/competition
>=85% performance on Chollet's abstraction and reasoning corpus, private set. As judged by Chollet et al.
2025 version: https://manifold.markets/JacobPfau/will-the-arcagi-grand-prize-be-clai-srb6t2awj1
The current leaderboard likely underrepresents the progress being made towards achieving the 85% threshold.
I think any winning approach will likely involve synthesizing many new puzzles for training/eval, and it will be easier and of strategic advantage to test systems on synthesized puzzle sets rather than going through the submission process.
One might only make a couple of submissions in order to model the relationship between scores on one's synthesized dataset vs the official one, and then "go stealth" until one has achieved a system whose score on the synthesized dataset would indicate a score >85% on the official dataset.
Now that I've bet a not-insignificant amount of mana on this, can someone explain why anyone would bet in favor of this? From what I understand, no one is even close, we wouldn't see an award even if the number of correct answer doubled and the machines probably got the easiest questions right so the remaining way is even harder.
Is this basically a bet on whether someone cheats? I filled a fuckton of limit orders at 20 percent odds and it feels like the odds would be optimistic at, like, 8?
I'm already invested, someone tell me what I'm missing.
So, writing out my reasoning for buying up to 15%:
- I think how well people performed in the past is not very reliable evidence as to the future, since the 1 million dollar grand prize and recent publicity make it way more likely to receive serious effort.
- Very few benchmarks exist where models under perform median humans and I don't find arguments as to why this one would be very different are compelling.
- Claude Sonnet 3.5 significantly outperforms GPT-4o without scaffolding so I'm suspicious that either scaling or post-training modifcations are in fact pretty helpful for this benchmark.
- LLAMA 3 400b seems likely to be released this year and potentially helpful.
Good reasoning I think!
My (light) hypothesis is that coding challenges are a good microcosm of basic logic tests of LLM's: they ace old ones, and fail miserably in new ones, which makes me thing there's direct or indirect memorization at play, basically they're overfit. The testing results so far makes me thing that the test is not solvable within the training data.
The million dollar thing being new is something I wasn't aware of and effects my willingness to dig myself deeper into a NO position. Thanks, this was a good comment!
https://livecodebench.github.io/ and https://livebench.ai/ suggest that the memorization problem exists but isn't extreme for most of the largest foundation models, many of which perform almost as well on benchmarks released before and after they were trained afaict.
I can tune back in later with something more to backup what I'm saying, who knows I might be behind or it was just a fabrication I believed: but if my memory serves there are coding challenges ranked easy to hard where old questions are answered with 100 percent accuracy and on new ones it drops to 0. This was GPT 4 if Im not mistaken. My model of the universe at this moment is one where LLM's don't memorize some things, they're nothing but memorization with some flexibility around semantics.
That said, I might be wrong
i dont think this will happen but i think if we look into open ai Q* / strawberry leaked papers then they are working on an intelligence that can actually think, and reason, so i think if they succeed then they will be able to get this price, without cheating. but i don't think they will succeed in 2024 as its a very fundamental and underlying problem of general intelligence, needed for true problem solving skills and not just prediction based on previous examples but also the ability to figure out new novel stuff on the fly, this would also lead to exponential scientific discoveries imo but lets see if they do it... illya Satskuver also made his company to so exactly this, super intelligence. oh well
I made a version of this market which allows for closed source LLMs: https://manifold.markets/RyanGreenblatt/by-when-will-85-be-reached-on-the-p
Here’s someone claiming 100% accuracy on the eval set with a from-scratch transformer: https://x.com/spatialweeb/status/1803950481422848312?s=46&t=fdgdiEzkLwQ2qvItoWggvg
(Doubt this holds up under scrutiny, likely a bug somewhere.)
I think you can a priori assign very low probability on this kind of stuff. If GPT4 and other models that took a 100s of millions of $ compute and a ton of very good engineers and only got to mid 30s on ARC, it's very unlikely that 1 person will just think of 1 trick that solves deep reasoning and gets to 100%.
This should be interesting: https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt
"No ARC human baseline exists! http://arcprize.org/arc: "most humans can solve on average 85% of ARC-AGI tasks." But this study used the train set http://arcprize.org/guide: "The public training set is significantly easier than the...public evaluation and private evaluation set""