Current AI agents (circa Jan 2024) are quite bad at clicking, reading screenshots, and interpreting the layout of webpages and GUIs. This is expected to change in the near future, with AI capable enough to navigate an arbitrary GUI about as well as a human.
Example of an early system of this type: https://github.com/OthersideAI/self-operating-computer/tree/main?tab=readme-ov-file#demo
Resolution criteria (provisional):
This question resolves YES if, the day after 2024 ends, I can direct an AI agent to resolve this market as YES using only voice commands while blindfolded. It resolves NO if this takes over 30 minutes.
Update:
There are no restrictions on whether the AI agent is free, open source, proprietary, local, remote, etcetera.
Update:
If someone else on Manifold can demonstrate an AI agent resolving a Manifold market as YES (while following the same restrictions that I would have followed), then I'll resolve this one as YES too. This is in case I'm not able to get access to the AI agent myself for testing.
Update:
The agent will need to be able to open a web browser and login to Manifold on its own.
https://docs.anthropic.com/en/docs/build-with-claude/computer-use
Let's go!
I wonder if it works well enough
@singer I think it's somewhat likely that this market will resolve YES, after testing it for a bit. The key thing is that for this question I can direct the AI while it's working (vocally, while blindfolded).
"OpenAI Shifts AI Battleground to Software That Operates Devices, Automates Tasks" [spamwalled; don't bother clicking]
OpenAI is developing a form of agent software to automate complex tasks by effectively taking over a customer’s device. The customer could then ask the ChatGPT agent to transfer data from a document to a spreadsheet for analysis, for instance, or to automatically fill out expense reports and enter them in accounting software. Those kinds of requests would trigger the agent to perform the clicks, cursor movements, text typing and other actions humans take as they work with different apps, according to a person with knowledge of the effort.
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (jykoh.com)
GPT-4V has a success rate of only 16.37% on web tasks, whereas human-level performance is 88.70%. Not sure whether resolving this market is one of the easier tasks, but it seems we have a way to go before AI achieves human-level web browsing.
I think this can already be done by hooking up a LLM to the macOS accessibility API.
I've also seen set-of-mark used to annotate screenshots, parse options, let LLM choose option, then clicking coordinates.
Might be doable with open-interpreter even: https://github.com/KillianLucas/open-interpreter/
Maybe I'll see if I can get it working then buy all the YES.
@ErikBjareholt while I expect the tech to be available soon, I'm very skeptical that any system can achieve the criteria at this exact moment. I'd love for you to prove me wrong.
@singer You might want to take a look at:
- https://github.com/ddupont808/GPT-4V-Act
- https://github.com/reworkd/tarsier
I'm likely going to be implementing a similar system soon (first half of 2024), so unless someone beats me to it, I'll have a go at it then.
@singer Will you be buying a Rabbit R1? They claim it can do this, and if not, that you can easily teach it to.
If not you might want to add precision, for example that it can be done using free software using a computer.
@SIMOROBO Good point. Devices/services like the R1 Rabbit and the AI pin would be eligible, and so should all premium chatgpt-like services. Even if I don't own it, as long as someone can demonstrate it having the capability in the criteria, I'll resolve this as YES.
(I'm not planning to get an R1 but if it can really do this I'll be considering it)