Will an AI system be able to fully refactor a 10k+ line codebase before 2026 ?
➕
Plus
109
Ṁ15k
2025
34%
chance

Will an AI system be able to fully refactor a 10k+ line codebase before 2026 ?

Growing capabilities and context lenght increase of recent AI systems will potentially allow ever more powerful applications concerning code and IT infrastructures in general.

A full refactoring is a long and intense process that require a important amount of skill and knowledge. Good refactoring usually increase efficiency and readability of codebases, while facilitating further improvements on the codebase.

Refactoring & generation rules

  • To be considered a valid refactoring, the AI refactoring should actually show, in one go : good readability, efficiency gain (if possible), harmonization of the syntax and structure of the code while not showing any loss in feature or specification.

  • The system would need to deduce everything related to code, configuration files and basically the whole github repo

  • Pre-generation user feedback is possible but should be 100% optionnal and should only concern architecture preferences, naming conventions and high level considerations.

  • Re-run of the same input by the user until getting a valid result will not be counted as success.

Reliability

  • It would need to have a very high average reliability (~95%+) accross various common programming languages (Python, Java, C++, C#, etc...) and librairies.

Allowed human interactions

  • Interaction that need administrator privilege and directly asked by the system for package installation or similar for example (feedback possible for this).

Additionnal

  • There is one attempt for the final code generation, but internally the system could go for as many iterative test-loop process as needed and use as many external tool as needed.

For resolution

I would prefer not to rely on a single source (including me) for the resolution,

that's why I will prefer using public benchmarks (that of course doesn't exist yet ...).

If not available I will go for online forum consensus.

Get
Ṁ1,000
and
S3.00
Sort by:

@Guillaume Would it be valid, if the codebase started out over 10k lines of code, but ended up significantly less, with all of the other stipulations met?

In my view, it hinges on how many more generations of AI systems we will get. Assuming there's a GPT-6 or equivalent by 2026, it should resolve as yes. That said, two more generations in 1.5 years would require a further acceleration in the pace of progress, which is what I'm actually betting on.

bought Ṁ40 YES

I bought YES as a hedge, at least if I'm unemployed I've made some mana.

bought Ṁ100 NO

Buying up NO as the conditions specified by the author seem highly unrealistic at this point.

bought Ṁ5 YES

@nsokolsky What's an example of one such condition? To me, all of them seem likely before 2025, never mind 2026.

@12c498e “95% average reliability” is one of them. I use GPT-4 daily and it’s maybe 80% accurate on the average task, much less so for ambiguous and abstract queries. What OP describes is such an advanced system that it would effectively result in 90% of software engineering jobs getting eliminated overnight.

  1. How would this be tested? Will any example of a refactor be sufficient (in which case I’m sure an example can be contrived already for Gemini Ultra)? Or will you be picking random GitHub repos with 10k lines and asking for a refactor?

  2. Does the code have to compile and run without any human intervention? Or will human intervention be acceptable - and if so, how many lines can humans change for this to count as YES?

  3. Does “one go” mean there’s only 1 attempt in total with no feedback? Does this also mean re-runs of the same input until a valid result is obtained are not acceptable?

  4. If the AI system runs the code on its own and keeps on doing refactoring until it compiles (latest GPT-4 can do this for Python), does this count as “one shot”?

  1. Yes of course a single lucky refactor would not suffice. It would need to have a very high average reliability (~95%+) accross various common programming languages (Python, Java, C++, C#, etc...) and librairies. I would prefer not to rely on a single source (including me) for the resolution, that's why I will prefer using public benchmarks (that of course doesn't exist yet ...). If not available I will go for online forum consensus.

  2. There will be no tolerance for the output on code modification, the system would need to deduce everything related to code, configuration files and basically the whole github repo (you can see this as a full repo generation). The only actions with human intervention that would be allowed is interaction that need administrator privilege and directly asked by the system for package installation or similar for example (feedback possible for this).

  3. Pre-generation user feedback is possible but should be 100% optionnal and should only concern architecture preferences, naming conventions and high level considerations. Re-run of the same input by the user until getting a valid result will not be counted as success.

  4. There is one attempt for the final code generation, but internally the system could go for as many iterative test-loop process as needed and use as many external tool as needed.

@nsokolsky I professionally use Aider for much of what the question state as criteria - but not sure that the current version really does all of it, because I've not had such an use case - but Aider + TreeSitter + GIT does come close to it... I recommend you have a look: https://github.com/paul-gauthier/aider?tab=readme-ov-file#example-chat-transcripts it's a nice tool!

@Magnus_ it seems like a peacemeal tool. I'm pretty sure it would fail for any reasonably big project, given that OP requested 95% success rate at one-shot performance.

bought Ṁ25 YES

Can't Gemini Ultra already do this?

@Pykess don't know if you can connect tools like Aider too gemini, but GPT4 does a quite good job when having access to treesitter data for your repository.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules