Will Grok 2 'exceed current [March 28 2024] AI on all metrics'?
โž•
Plus
152
แน€52k
resolved Dec 25
Resolved
NO

On March 29, Elon Musk tweeted this (https://twitter.com/elonmusk/status/1773655245769330757 ):

"Should be available on ๐• next week. Grok 2 should exceed current AI on all metrics. In training now."

Is that so? Let's find out.

Note that for this purpose it counts as 'Grok 2' even if it is renamed - the only way a newly announced xAI model does not count as that is if it is named Grok 1.X, or otherwise is clearly pre-2, but the thing in training now counts whatever they ultimately call it, if they release it, etc.


Resolves YES if Grok 2 is released and it exceeds or ties (to 1 decimal place) Claude 3 Opus and all metrics for models available to the public in some form on or before 3/28/24 on everything in this chart:

So MMLU, GPQA, GSM8K, AMTH, MGSM, HumanEval, DROP, Big-Bench-Hard, ARC-Challenge and HellaSwag.

Resolves NO if Grok 2 is released and does NOT exceed or tie these numbers on one or more of these metrics, or if Grok 2 is not released by EOY 2025.

If xAI does not test on all of these metrics, but it succeeds on all metrics that it does test, and there is no way to test on the others, I will use best judgment - if it clearly would have exceeded I will still resolve YES, but by default (or if it would have been close) I will assume they chose which metrics to test on based on results, and be inclined to count that as NO. Will clarify further if this gets a lot of interest, as needed.

Get
แน€1,000
and
S3.00
Sort by:

I think this is a No.

@ZviMowshowitz What are the official benchmark scores?

Live bench is out. It's weaker than Claude 3 Opus. https://livebench.ai/.

Remains only 7 ELO above opus on style controlled hard questions on lmsys arena. (Within confidence interval)

Updates me toward this resolving no in spirit.

Are we not making any progress on getting an actual result? Is this still actually unclear?

Seems like we should find a way to wrap this up soon, but the market reflects real uncertainty on the outcome. Can no one find a way to check?

I've moved the deadline forward to 12/24/24, as a 'this is when I just use my best judgment' date.

@ZviMowshowitz Both Drop and MGSM are considered to be saturated by OpenAI (https://github.com/openai/simple-evals) and there seem to be significant problems with DROP (https://huggingface.co/blog/open-llm-leaderboard-drop).

Taken literally, I'm at 35% on this -- sheer randomness suggests underperformance on one of these benchmarks.

In the spirit of the question, this is quite likely "yes" - even lmsys (style control) shows Grok at LLama 3.1 405 level, which seems to beat Opus on any metric I can find.

Beating both Drop and Gsm8k may be difficult. First benchmark is known to not be steadily progressing on the GPT4 series; second is so close to saturation randomness affects results.

Note that resolving this question might require running tests on GPT-4-0125-preview.

bought แน€50 NO

Good thing I didnt put in more, really didnt think youd compare a model released to models from 5 months ago

I will be holding off resolution for a bit...

What would make you resolve this?

I think we either need a source for tests we can trust, or enough other confidence? Clearly with this at 62% we have neither of these things...

Okay, they are planning to release an API, then we can run the evaluations ourselves

Particularly relevant are the three words after the title, "Of course not."

According to the xAI's own benchmark, Grok 2 does not beat Claude 3.5 Opus.

EDIT: SORRY I MEAN SONNET (counterintuitive since the smaller anthropic model does better)

bought แน€150 YES

Was not available before 3/28

3.5 Sonnet was released June, so yes you're right

What if Grok 2 is bad but gets much better within the year? https://x.com/rohanpaul_ai/status/1823814591157297567
Some think the current thing is "early"

I would suggest to hold off resolution until it was independently verified that xAIs claims are true. Elon musk has a long history of lying about everything.

Cope

I am not invested in the market

bought แน€1,490 YES

It beats opus on every benchmark that is still tested (some of them in the image are outdated like GSM8K) https://x.ai/blog/grok-2

bought แน€100 YES from 86% to 89%
bought แน€100 YES

How do you adjudicate differences in evaluations across models? e.g. "0-shot CoT" vs "4-shot" on MATH in the table? Does Grok 2 have to report the same evaluation type as Claude 3 for each benchmark?

opened a แน€100 NO at 79% order

@ZviMowshowitz FYI these choices frequently flip the order of model performance.

ยฉ Manifold Markets, Inc.โ€ขTerms + Mana-only Termsโ€ขPrivacyโ€ขRules