The GAIA benchmark (https://arxiv.org/abs/2311.12983) aims to test for the next level of capability for AI agents.
Quoting from the paper: "GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins."
This market will resolve based on when an AI system performs as well or better than humans on all 3 of the different levels of the benchmark. I'll use the numbers from Table 4 in paper: 93.9% on level 1, 91.8% on level 2, and 87.3% on level 3.
(I'm using the conjunction of all 3 levels rather than the average to be somewhat conservative about this level being achieved.)
If a given submission was likely trained on the test set (based on my judgement), I won't consider this valid.
This market resolves based on the date of publication/submission of a credible document or leaderboard entry which indicates that the corresponding performance on GAIA was reached. (Not the date at which the system was originally created.)
Each date will resolve YES if this publication/submission takes place before that date (UTC). Otherwise NO.
(I may add additional options later to add additional resolution.)
@UnspecifiedPerson The questions look pretty easy to me when given internet access. (Maybe the main issue would be that you don't find something you need as a subset with google? But this is pretty unlikely.)
Separately, how do you think the human scores in the paper are so high if most people would fail?
(I think that literally "most people would do poorly" might be true, but I expect that most of the top 25% of college students in the US would do great.)
@RyanGreenblatt Some questions from the easiest level of GAIA:
"In the fictional language of Tizin, basic sentences are arranged with the Verb first, followed by the direct object, followed by the subject of the sentence. I want to express my love for apples to my Tizin friend. The word that indicates oneself is "Pa" is the nominative form, "Mato" is the accusative form, and "Sing" is the genitive form. The root verb that indicates an intense like for something is "Maktay". When it is used in the present, it is used in it's root form, when it is used in the preterit past, it is "Tay", and when it is used in the imperfect past, it is "Aktay". It is used differently than in English, and is better translated as "is pleasing to", meaning that the thing doing the liking is actually the object of the sentence rather than the subject. The word for apples is borrowed from English in Tizin, and so it is "Apple" is the nominative form, "Zapple" is the accusative form, and "Izapple" is the genitive form. Please translate "I like apples" to Tizin."
"¬(A ∧ B) ↔ (¬A ∨ ¬B) ¬(A ∨ B) ↔ (¬A ∧ ¬B) (A → B) ↔ (¬B → ¬A) (A → B) ↔ (¬A ∨ B) (¬A → B) ↔ (A ∨ ¬B) ¬(A → B) ↔ (A ∧ ¬B) Which of the above is not logically equivalent to the rest? Provide the full statement that doesn't fit."
"Given this table defining \* on the set S = {a, b, c, d, e}
|\*|a|b|c|d|e|
|---|---|---|---|---|---|
|a|a|b|c|b|d|
|b|b|c|a|e|c|
|c|c|a|b|b|a|
|d|b|e|b|e|d|
|e|d|b|a|d|c|
provide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order."
"What is the final numeric output from the attached Python code?"
"You are Van Helsing, a renowned vampire hunter. A Count of Moldova, Lațcu IV, son of Costea, has tasked you with investigating the village of Șirnea in neighboring Wallachia. The Count's advisors have reported that a vampire was spotted crossing the border near the village, and would like you to investigate it. You travel to the village of Șirnea, and you begin your investigation. One night, just before dawn, you catch a glimpse of a man in a long black cape with red lining leaping from roof-top to roof-top with superhuman agility. It's a vampire! You try to chase the creature back to its home, but the creature is too fast. However, because of the remoteness of the village, you know with absolute certainty that the vampire must be a resident of the village. You decide that your best course of action will be to visit all 100 residents of the town during the day. You know something about vampires and humans that will make your investigation possible; humans always tell the truth, but vampires always lie. In the afternoon, you go from house to house, speaking with all 100 residents of Șirnea. You ask everyone the same question: "How many vampires are living in Șirnea". Everyone in the village gives the same response, "At least one of us is a human." How many residents of Șirnea have been turned into vampires?"
"A standard Rubik’s cube has been broken into cubes making up its sides. The cubes are jumbled, and one is removed. There are 6 cubes with one colored face, 12 edge cubes with two colored faces, and 8 corner cubes with three colored faces. All blue cubes have been found. All cubes directly left, right, above, and below the orange center cube have been found, along with the center cube. The green corners have all been found, along with all green that borders yellow. For all orange cubes found, the opposite face’s cubes have been found. The removed cube has two colors on its faces. What are they? Answer using a comma separated list, with the colors ordered alphabetically."
Again, all of these are from the easiest level. They "should be breakable by very good LLMs", as opposed to the questions which would "indicate a strong jump in model capabilities".
I do not know how the human scores in the paper are so high, even taking into account that quite a lot of them have PhDs, and that (if I'm understanding the paper correctly, which I'm not certain of) they were allowed to work in teams to answer the questions.
@UnspecifiedPerson Huh, these questions seem much harder than the examples I looked at in the body of the paper. I wonder if the benchmark is somewhat gamed. Would be good to see what a held out human sample looks like.