Right now, AutoGPT and BabyAGI don't really work. If you give them a complex task they go off the rails and get stuck in loops, so most people prefer to use the models directly rather than an agent scaffold.
Will there be advancements in scaffolding this year or new models such as GPT-5 that solve these problems and allow them to mostly work? I will try the agents myself, and factor in whether it is common for people to use LLM agent frameworks for productivity assistant tasks.
https://theaidigest.org/agent a demo of an AI agent you can give tasks to, right inside a webpage. useful for getting a sense of current capabilities!
Devin and swe-agent are apparently doing better than RAG on coding tasks, I haven’t looked at how they work yet but they made me update higher on this question
The type of task I had in mind when writing this was like “write me a summary of the current state of this industry or field of study” and the agent would figure out the best workflow, do web searches, read pages, do more web searches based on what it found, and ultimately complete the task with a better result than a single pass web search and summarization
Another way of framing the criteria is “Will AI agents be worth using instead of using the LLM directly for a significant portion of tasks”
@RemNi my whole and ongoing experience with agent frameworks are that they simply goes nuts and spending the same amount of time actually coding a system is far more easier.
I assume so, but just to make sure: does this resolve YES if new, better agent frameworks get developed on top of LLMs (as opposed to AutoGPT and BabyAGI being improved)?
Also, now that I'm really thinking about this, how do you intend to resolve the question?
@inaimathi Yes, new agent frameworks that operate similarly will count.
I'm not sure I understand what you mean by how I intend to resolve. It's fundamentally a subjective assessment of the usefulness and reliability of the agents, but let me know if you have specific questions about what would count
@ahalekelly Yeah; that's what I was getting at (I wanted to know if you had specific metrics/test processes in mind or if it was going to be a subjective resolution).