A duplicate of /ScottAlexander/in-2028-will-gary-marcus-still-be-a, with the ban on "bizarre hacking like tricks" removed and clearer resolution criteria.
This market resolves based on the behavior of all leading chatbots at the beginning of 2028. (Only ones that can actually be tested.)
Resolves YES if people can find three extremely obvious questions, that an average human teenager could certainly answer, which any leading chatbot still fails at at least half the time when asked.
Only the LLM portion of the chatbot is being tested here. Image-recognition and generation capabilities are not.
will the llm be allowed to output Chain Of Thought? ie, "answer and nothing else" type responses it fails on very basic stuff and likely will for some time.
However, if it's allowed to do chain of thought (provide step by step thinking), it's reasoning skills 10x
As I mentioned in the other market, the magickarp token parsing bug is well understood and is orthogonal to llm reasoning capabilities. Whether the tokenizer is improved and how it is improved I don't think will make a huge impact except for a niche class of prompts.