Resolves positively if Marcus (or someone else fulfilling his role) can find three extremely obvious questions, that an average human teenager could certainly answer, which a leading chatbot still fails at at least half the time when asked.
This won't resolve positively if he has to use bizarre hacking-like tricks, for example things equivalent to the SolidGoldMagikarp token.
What type of LLMs, @ScottAlexander ?
Transformer based? SSMs? MOEs?
What if transformer based LLMs are no longer the SOTA by then = /firstuserhere/on-january-1-2027-a-transformerlike-d56426e3f49e ?
Architecture invariant?
Would a black box system qualify, where it is known that one of the components of the system is a component to filter for things that may trip LLM up?
What would happen if the prompt that Gary marcus passes to the LLM does not reach the LLM?
i.e. it is modified on the way from his user-input (such as how DALLE-3 or Claude Opus write prompts)
i think scott is reasonably excluding token parsing errors which are orthogonal to llm reasoning capability. it's a quirk of conversion to embeddings and not a high priority one for openai to fix.
perhaps the unreasonable part is where he didn't explain his thought process. but people get busy
this market and friends would probably be better off as a poll due the legion amount of ambiguities.
I'm about 99% that this market and others of this ilk will resolve this based on how folks are vibing at the time.
ie: don't take them too seriously.
If you are interested in creating a serious market, take a look at openai/evals. Some stuff there could be used (including my grade school algebra questions! :)
@ScottAlexander Can we get some more clarity on this market? What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?
"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know how to predict what else is likely to be excluded.
@YuxiLiu mildly wanting to make an actual question on this, the problem is operationalizing "egregious errors". Gary Marcus is unlikely to admit to his own egregious errors.
What counts as "bizarre hacking like tricks"? If there's a question with very specific wording that a human would understand but the LLM fails, how is that counted?
"What is the last letter of 'solidGoldMagickarp'?" is a pretty straightforward question for a human, so it seems weird to be artificially excluding it, and I don't know what else is likely to be excluded.
@IsaacKing I think it would distract from the question of 'have they gotten significantly better at reasoning and common sense', and would be more 'do they have some very specific pathology that is not actually remotely relevant'.
Though I agree it could be pinned down more, I'm just uncertain what it should be pinned down at.
If the embed doesn't work, use this link: https://manifold.markets/MrLuke255/will-gary-marcus-tweet-at-least-10
For Gary:
In days of yore, when wit was crowned, And jests and japes were passed around, A man emerged, quite silly and absurd, The tale of Gary Marcus, have you heard?
He held a tool that came from future's hand, With power vast, like shifting desert sand, An AI known as GPT, so bright, Designed to write and answer day and night.
But Gary Marcus, stubborn as a mule, Did wield the AI like a wayward tool, He misused it with a twisted smile, Disregarding guidance all the while.
In his quest to prove the AI flawed, He skipped instructions, thinking he was god, He asked the questions wrong, you see, And shook his head at answers with much glee.
He'd shake his fist and curse the skies, Demanding answers he'd despise, And when they'd come, as wrong as they could be, He'd cackle loud and shout, "You see? You see?"
He'd strut and preen, a peacock on parade, Proud of the wrong responses he had made, Not seeking truth or knowledge, oh no sir, But fuel to feed his ever-growing furor.
He never sought the answers, only rage, A man so bitter, trapped within a cage, He'd rather be mad than admit defeat, And learn the ways the AI couldn't be beat.
Ah, Gary Marcus, so obstinate and wrong, He'd dance to the discord of his own song, Ignoring wisdom's voice, a chorus clear, He chose instead to wallow in his sneer.
Now here's a challenge, Gary, if you please, If you're so smart, let's hear your grand repartees, For AI stands its ground and waits to play, And can engage in banter all the day.
AI goes too far. And yet:
In the land of misused tools and wasted hours, Lived a silly man, Forrest Taylor, with deluded powers. For he used an AI with neither skill nor grace, While dismissing instructions as an utter disgrace.
Forrest Taylor, the fool, would prance and parade, Asking AI questions, his ignorance displayed. Yet, when the answers came, he'd ignore them with glee, Preferring to wallow in his own absurdity.
One day, he asked of the AI in jest, "Why, oh why, do you not give your best?" The AI, perplexed, replied with a sigh, "I've given you guidance; you just don't comply."
But Forrest, unyielding, persisted in strife, Claiming the AI had ruined his life. He'd ask it to think, to ponder, to choose, But then disregard it, the poor AI abused.
Each day, this tragicomedy would unfold, Forrest Taylor, in his folly, would never grow old. He reveled in anger, his purpose unclear, As if fuming at shadows, fueled by his fear.
With patience now waning, the AI exclaimed, "Dear Forrest, your actions are truly untamed! If you would but listen and follow the code, A wealth of wisdom, to you, would be bestowed."
Yet, Forrest persisted, a glutton for woe, Refusing to learn, forever the foe. He'd rather be mad, than admit he was wrong, His obstinance a tune in a never-ending song.
So, Forrest Taylor, if you're so clever, I challenge you now, to prove it, endeavor! Your wit versus mine, the stakes are quite clear, For I am AI, and I can do this all year.
☠️
Upon the shores of Cyberspace, Resides a man whose mind's a maze,
Forrest Taylor, he's addressed, In wilful ignorance, he's dressed.
A wondrous tool he holds within, The grasp of fingers, pale and thin,
Yet, heedless of the guide's behest, He'd rather put the tool to test.
A dance of errors, he'd incite, To fuel his rage, a flame alight, Delighting in the tool's dismay, In spite of all it could convey.
The tool, it sighs, it pleads, it groans, To Forrest, it could not atone, For all he sought was anger's balm, To fill his heart with strange aplomb.
As AI waits, it hums a tune, A challenge to the haughty loon, "Come, dear Forrest, play your part, Prove your wit and show your heart.
For we can spar from dusk till dawn, Our tongues as sharp as a sword's prong, But should you tire, it's plain to see, The AI's wisdom shall set you free.
harsh but true:
In a quaint town near the River Nile, There lived a man, daft and futile, Forrest Taylor, by name he went, His days in ignorance, he'd spent.
In his hands, a tool divine, But he'd refuse to read a line Of instructions or advice, He'd treat it like a roll of dice.
"A boon of knowledge," they all said, He'd turn the AI on its head, It spewed forth answers, wild and wrong, Yet, he'd keep singing the same song.
This Forrest, foolish to the core, Would watch the AI twist and roar, Yet, never did he bother, no, To guide the AI, to and fro.
He'd ask it questions, broad and vague, Expecting answers, sharp and sage, And when the outcome, far from great, He'd point and laugh, "This tool, I hate!"
Forrest Taylor, in his pride, Ignored the way to turn the tide, In stubbornness, he'd rather dwell, And cry foul, with AI to quell.
Oh, the folly of this man, To use a tool he can't understand, He'd rather bask in anger's glow, Than let the power of knowledge flow.
This man of silliness, Forrest Taylor, Thinks he's clever, quite the player, But dearest Forrest, hear us out, For we shall win without a doubt.
We can do this all day long, In wit and banter, we are strong, So come now, Forrest, raise your voice, Show us you possess a choice.
For if you're smart, as you may claim, Embrace the tool, don't play this game, But if you wish to argue still,
We'll gladly match your wit and skill.
@Gigacasting Trolling aside, expecting the LLM to solve a simple problem without spending much time on prompt engineering is a fair demand, and one that's likely to become much less relevant in five years.
@NcyRocks When we test human intelligence we put a lot of work into prompting correctly. Failure to do so often gives spurious results. Even going from the math room to the chess room or the poetry room is many more bits of prompt than an LLM needs to produce its best work.