Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?
Basic
3
Ṁ3152026
95%
chance
1D
1W
1M
ALL
METR has found that current frontier models get a score on their autonomy benchmark roughly similar to a human who is given 30 minutes. Will at least one model score at the level of a human given 2 hours by 2026?
Clarifications:
I will try to resolve this market in accordance with the current task suite. If METR makes the suite harder or easier I will try to account for this in the resolution of this market.
if I am not able to determine the performance of frontier models at the end of 2025, this market will be resolved NA
This question is managed and resolved by Manifold.
Get
1,000
and3.00
Related questions
Related questions
In 2024, will METR or Google announce the results of a METR eval on a Google LLM?
72% chance
Will an LLM be able to solve the Self-Referential Aptitude Test before 2025?
19% chance
Will there be any simple text-based task that most humans can solve, but top LLMs can't? By the end of 2026
64% chance
Will an LLM be able to match the ground truth >85% of the time when performing PII detection by 2024 end?
84% chance
Will we see improvements in the TruthfulQA LLM benchmark in 2024?
74% chance
Will an LLM be able to solve the Self-Referential Aptitude Test before 2027?
66% chance
LLM Hallucination: Will an LLM score >90% on SimpleQA before 2026?
60% chance
Will LLMs be better than typical white-collar workers on all computer tasks before 2026?
27% chance
Will there be an LLM capable of performing full-time web application hacking by 2025
19% chance
Will there be any text-based task that most humans can solve, but top LLMs won't? By the end of 2024
95% chance