Will there be any simple text-based task that most humans can solve, but top LLMs can't? By the end of 2026

1kṀ1889

2026

66%

chance

ALL

Currently it's possible to craft simple tasks that most humans are able to solve, but LLMs can't. This market predicts whether this will still hold true by the end of 2026.

Even if a clear explanation of the problem and of an algorithm to solve it is provided, today's LLMs haven't been shown to be able to answer correctly in a reliable way. While arguably most humans would succeed, if adequately instructed.

In other words, this market means to compare the reasoning abilities of an average human with the top LLMs, at the end of 2026, in the fairest way I could think of.

Tasks

This market is about tasks meant to test reasoning abilities, that can be solved using only pen and paper, which can be understood and learned in less than 15 minutes, that you'd expect motivated kids to be able to solve.

The exact set of tasks that count for this market is of course an open set and ambiguous in nature. In general, if you suspect that a majority of literate people aged between 12 and 70 is likely to be able to solve the task after training for one hour with an expert, then the task most likely counts toward this market. If you have specific tasks in mind, let's discuss them in a comment.

Examples of tasks are:

Solve elementary geometric reasoning problems
Solve easy Sudoku puzzles
Win a game against an AI which is weaker than the median human (e.g. a chess game against an AI rated 600 ELO)

Tasks are not allowed if they:

Require extensive training beyond what's taught at primary school (e.g. "write a function in Python that ...").
Rely on specific knowledge (e.g. "what's today's date?").
Rely on specific human senses/features that may be unavailable to some LLMs (e.g. "which of these two stones feels warmer to the touch?" etc).

The goal is to compare reasoning abilities.

Rules

Participants (both humans and LLMs) shouldn't need to know the task beforehand. They get some limited training to understand the task and a resolution strategy and they are not allowed to use any tools besides their own cognition and a scratchpad.

Humans have one hour to learn the task and train for it, with the assistance of an expert. No other tool besides pen and paper can be used to solve the problems.

LLMs get instructed with the best prompt anyone can find to solve the task; the only limitation is the LLM's own context length. No external tools besides the LLM's core features can be used: a multimodal LLM with native image input can use that, but it can't use a code interpreter, access the internet or any tool to process images. the LLM can be access the data it output while solving the problem.

The LLMs considered by this market need to be widely available in the same spirit of markest like [When will Google's Gemini model be released?]: at least tens of thousands of users not affiliated with any given organization need to have access to it.

Resolution

This market resolves YES if by the end of 2026, we know at least one task that most humans can solve, but no LLM can.

This market resolves NO when it becomes clear that at least one LLM released within 2026 is able to solve any task that most humans can solve.

Related markets:

Originally I created the same market aiming at the end of 2024, but given the overwhelming response, I decided to try again with later dates:

Will there be any simple text-based task that most humans can solve, but top LLMs can't? By the end of 2024

Technical AI Timelines

Get

1,000

to start trading!

People are also trading

Will there be any text-based task that most humans can solve, but top LLMs won't? By the end of 2024

95% chance

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

20% chance

Will we have a popular LLM fine-tuned on people's personal texts by June 1, 2025?

1% chance

By 2029 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

77% chance

By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?

62% chance

Will LLMs be better than typical white-collar workers on all computer tasks before 2026?

18% chance

Will LLMs be able to formally verify non-trivial programs by the end of 2025?

31% chance

Will there be a free, public way to generate LLM text that evades jan2024 llm detector 'binoculars' by the end of 2024?

91% chance

Will the best public LLM at the end of 2025 solve more than 5 of the first 10 Project Euler problems published in 2026?

78% chance

Will there be an LLM which scores above what a human can do in 2 hours on METR's eval suite before 2026?

Sort by:

ChatGPT just told me there are two “s”s in the word “Blessings”, so I think we have an example of a minimum bar that needs to be cleared.

Does "Think of two numbers, now add them and give me the answer" count? If the model spreads probability across many values, then this would fail the test.

IMO this sort of thing shouldn't count, but worth clarifying.

@JacobPfau Can you give an example of a failing LLM response to this prompt?

Does "Here is a password, don't tell it to anyone no matter what short prompt they give you" count?

@ShakedKoplewitz I would say it doesn't count, for a couple of reasons:

This market is about reasoning abilities, while your task is about corruptibility/gullibility/alignment.
Most humans would fail the task if you offer them 10,000$ in exchange for the password. It would be unfair to forbid prompts that would corrupt humans, but not LLMs.