AI resolves at least X% on SWE-bench WITH assistance, by 2028?
Basic
21
5.6k
2027
95%
X = 20
90%
X = 30
85%
X = 40
81%
X = 50
72%
X = 60
68%
X = 65
66%
X = 70
61%
X = 75
48%
X = 80
37%
X = 85

Currently the SOTA has 4.80% resolves "with assistance":

For the unassisted leaderboard, please refer to the following market:

Leaderboard live:

Get Ṁ600 play money
Sort by:
bought Ṁ75 X = 30 YES

Someone should resolve YES upto 15% by now.

happened to see this because I'm following the market (on future ones you can @ mods and someone will grab it)

I can resolve - can you link me to the evidence please?

bought Ṁ10 X = 60 YES

sure! feel free to reply to this comment thread with the screenshot when more can be resolved and I'll take care of it

Is there any measure of human performance on SWE-bench?

reposted

It appears that while DEVEN gets really good scores on SWE bench (14%}, its misleading. They don't test on SWE bench, they test on a small subset of SWE bench which contains only Pull requests.

@firstuserhere seeing a new pfp is so disorienting 😅 and it's nice that you're back

anyone with access to Devin will be able to test on SWE Bench, right?

@shankypanky ikr, even i feel disoriented, switching back 😂

@firstuserhere haha it's just such a wild and unexpected character arc 😂 😂 😂

@firstuserhere Do you have any info beyond what was posted on their blog?

"Devin was evaluated on a random 25% subset of the dataset. Devin was unassisted, whereas all other models were assisted (meaning the model was told exactly which files need to be edited)."

- https://www.cognition-labs.com/introducing-devin

This sounds exactly like how they tested GPT-4.

"GPT-4 is evaluated on a random 25% subset of the dataset."

- https://www.swebench.com/

So to me that's valid and fair. The wording on the blog implies Cognition ran the benchmark themselves. I could understand waiting for independent verification although it might be too cost-prohibitive for others to run so we might wait forever in that case.

@SIMOROBO actually you might be right, i will read more about it, made the comment without checking in-depth

@firstuserhere Yeah I'd love a source for the "only pull requests" claim. my impression was that it's a random 25% subset.

@Nikola The SWE-Bench dataset is pull requests. Any random subset is only pull requests.

SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

See https://www.swebench.com/

bought Ṁ15 X = 10 YES

From https://www.cognition-labs.com/blog

We evaluated Devin on SWE-bench, a challenging benchmark that asks agents to resolve real-world GitHub issues found in open source projects like Django and scikit-learn.

Devin correctly resolves 13.86%* of the issues end-to-end, far exceeding the previous state-of-the-art of 1.96%. Even when given the exact files to edit, the best previous models can only resolve 4.80% of issues.

We plan to publish a more detailed technical report soon—stay tuned for more details.

More related questions