Will AI agents be able to regularly code small features for us in a year?
266
20kṀ230k
Jul 2
81%
chance

I'm thinking of something like https://mentat.ai/, but that actually works.

I will provide a paragraph or so describing the change I want made. Then it should create a GitHub PR, which I will review and leave only a few comments before merging. The whole process should take less than 30 minutes. This should work fairly reliably.

I tried this yesterday and it failed haha:
https://github.com/manifoldmarkets/manifold/pull/2694

See more discussion in my post:

https://jamesgrugett.com/p/software-automation-will-make-us

Get
Ṁ1,000
to start trading!
Sort by:

Crazy how ai agents are regularly building small features for me almost daily and this market is still at 80%

I genuinely believe AI agents will be coding small features regularly within a year, some already are. I’ve used a few tools to generate basic frontend code and automate boring backend stuff, and it’s only getting better. Especially in niche areas like fintech white-label software (check out https://kindgeek.com/white_label), where standardized components are key, AI can save tons of time. As humans, we tend to overcomplicate things, but AI thrives on patterns and repetition. I still double-check everything it writes, but it’s a massive time-saver.

I'd like to conduct some tests using codebuff/cursor. What are acceptable small features in your mind? I have a couple ideas:
- add a button to the comments bottom row that allows users to tip the commenter. Denormalize the tip amount onto the comment and display the total tipped amount on the button.
- Add a delete button for admins/mods that marks a comment as deleted (don't actually delete the comment, just set the deleted flag and hidden flags both) that hides the comment completely from the market.

@JamesGrugett said the delete comment button for spam fit the bill, I'll try using codebuff to do this soon

@ian a "view results" button on polls?

@cthor Also seems reasonable!

@ian I am aware that you work on Manifold, but since you are also the largest YES holder can we maybe agree to let @JamesGrugett do these kinds of evaluations once time comes.

@CalibratedNeutral That sounds reasonable, although he doesn't work at manifold anymore so I'm not sure if he'll want to put 30 mins in to do this. I was going to film my attempt from scratch

@CalibratedNeutral I was not aware of that. Then maybe a third party (another developer working on Manifold)? The stakes are reasonably high for me, so I really would strongly prefer to have everything as unbiased as possible.

@CalibratedNeutral We might be able to get @SG or @SirSalty to do it

@CalibratedNeutral Alternatively, @JamesGrugett could test this question on his new startup, codebuff. He uses codebuff to help develop codebuff

Also:

I didn't conduct this test perfectly but Codebuff just built the new market drafts feature with something like 4-5 prompts and a few of my own edits in ~45 minutes.

bought Ṁ50 YES

I would say that Devin already works. It’s not perfect but should fit criteria

Im surprised this market hasn't gone up at all in response to the o1 releases.

I bought yes because I've seen GitHub's Copilot Workspace already do promisingly well in my brief tests. By mid-2025, I can definitely see it being good enough to do real work on some codebases (especially if you have a good test suite).

opened a Ṁ20,000 YES at 60% order

if James doesn’t get accepted into AI grants, then there will be something better as an alternative, otherwise manicode will be coding features for us in a year

Will you @JamesGrugett provide additional repo-level, AI-specific documentation as you describe in https://manifold.markets/JamesGrugett/will-manicode-be-accepted-into-ai-g ?

From a reading of the question description text, I'd say that shouldn't be allowed: description mentions mentat.ai and "provide a paragraph or so"--both of which suggest no such AI-specific handholding.

Will Manicode be accepted into AI Grant batch 4?
24% chance. https://aigrant.com/ Here is my application (selected questions only): Provide a short summary of your product Better code generation than Cursor Describe your product in more detail Run manicode in your terminal. Ask it to do any coding task. It will make changes to your files. ...and it will do a really good job. Why? It has full access to read and write to your files, run terminal commands, and scrape the web It can: grab files it needs for context, edit multiple files at once (no copy-pasting), run the type checker, run tests, install dependencies, and search for documentation. These abilities are key to doing a good job and will only become more powerful as LLM's continue to level up. It uses so-called "knowledge" files LLM's perform so much better with extra context! With Manicode, we've come up with this idea to check in knowledge.md files in any directory, and write down extra bits of context, like which 3 files you need to edit in order to create a new endpoint. Or which patterns are being deprecated and which should be used. Or which directories can import from other directories. Every codebase has lots of i tomplicit knowledge like this that you have to impart to your engineers. Once written down, it makes Claude really fly! It's truly a night and day difference. It's synchronous, and you can give feedback You're chatting with it. It takes ~30 seconds to get back to you and then you can tell it what you want to do next or what it did wrong. This keeps Manicode on track and aligned. It learns The flow of using Manicode is: Ask it to do something If it fails, point out its error Manicode fixes the error and automatically writes down how it can improve for next time in a knowledge file You push the commit, and now Manicode has become even more capable when the next engineer runs it in the codebase. This is the magic loop that will make Manicode productive for experienced engineers in giant codebases. We're unafraid to spend for better results We can use as many parallel API calls with as much context as we can to produce the best code, because we know that the alternative is human labor, which is much more expensive. We're targeting the largest market for software engineers It's a tool for daily use by experts (not just junior engineers) It's for software maintainers (not just people starting new projects) We're starting with a console application, because it's simple and has great distribution Every developer knows how to install new packages with npm or pip. Most developers already have the terminal accessible: as a pane in your vscode window, for example. The timing is right Claude Sonnet 3.5 passed some bar of coding competence, and the form factor of a fully capable agent that can change any file works now, whereas before you could only reliably edit one function at a time. There is a moat after all Handling every tech stack well, knowing when to run their tests and type check, integrating with git, linear, slack, and email, supporting database migrations, etc, etc, etc. You can build hundreds or thousands of special case prompt magic to improve things so that it always just magically works the first time. A startup arriving at this 6 months late wouldn't catch up. Try it out! > npm install -g manicode > manicode Intro video (https://www.youtube.com/embed/ZzT4HIhnzio)Demo video https://www.loom.com/share/2067e3ad5fdf4565905f6aeb8f13b215?sid=de0e9ad8-447a-485a-bcb3-71b8a5a43665 Addendum I submitted this last night. The few things I forgot to include: The prototype is communicating with my server over websockets and so is significantly more complex than running a local script. It is already set up to work on any project immediately. I intend to charge $100 per month per user to get off the free plan (and some usage based fees after that if you use it a huge amount). Giving manicode full access to your files and terminal where it can run stuff without confirmation from the user sounds scary, but is actually not risky in reality, especially if you have version control. This quality of doing something that normal people think "goes too far" or seems unsafe is a correlated with good startup ideas, because it means fewer people are likely to have thought of it. (E.g. For Airbnb: You let random strangers sleep in your house? Or Manifold: You let anyone ask and judge the resolution of their own question?) They said they will let us know if we won by September 20th at the latest.

Hi, great question!

When I created this market, I didn't imagine I would be building my own AI agent for coding.

Regarding human-created context on the codebase, I do think that should be allowed! Adding a bit of documentation seems like fair game. If, however, the context were specifying in detail how to make the coding changes for the specific feature, that would seem unfair.

Also, I think a little bit of back-and-forth with the AI should be allowed, since I did specify you could leave some comments, and that it should take under 30 minutes.

I think manicode does not yet qualify, since I'm not sure it would work 90% of the time, without manual intervention or extended back-and-forth.

Thanks for clarifying.

To be frank, the fact that you are literally designing your own AI presumably optimized for Manifold Github functionality wildly changes the odds on this question. Obviously can't know what projects will spin up over the course of the year (so fair play), but the phrasing of this question came off to me as pointing at 3rd party, general AI agents rather than Manifold-bespoke AI agents.

I understand. I will try to raise the bar of expectations if it feels like manicode is especially good at the manifold codebase compared to others. I don't really think this will be the case though.

While it is not coding, AI code review could be helpful. Take for example https://coderabbit.ai. It does a pretty nice summary as well as code review. They are also free for open source so you could try them out.

Here is an example that shows how it could be useful: https://github.com/jsonresume/jsonresume.org/pull/131#issuecomment-2236198926

I have this at ~30%. Anyone want to explain their reasoning? 90% success is a very high bar, compare to SWE bench which includes test cases (presumably James doesn’t always pre specify these) and yet current sota is only 20%.

Does "fairly reliably" roughly mean 75% success, 90%, 98%, ...?

90%!

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules