By 2027-12-31, will we be able to synthesize basically any photo from a text description and sample images of people we want in the photo?
Resolution criteria: I'm doing my best to make this objective by adding to the FAQ below and am hopeful that it won't come down to a judgment call. If it does, I'll do my best to gather input and hew to the spirit of the prediction. I'm betting in this market myself and promise to be transparent about my reasoning in the comments. AMA!
FAQ
1. Does the photo have to be undetectable as AI-generated?
No, if the photo is perfectly believable on casual inspection but can be determined to be AI-generated with some forensics, that still counts as YES.
Of course detecting that the image is fake via real-world knowledge -- like knowing things about the human test subjects or just knowing that the depicted scene is fantastical -- isn't grounds for resolving NO.
2. How well does it have to understand the text descriptions?
Much better than 2022 DALL-E but it doesn't have to be quite human-level. If you have to work to find convoluted descriptions that confuse it, that still counts as YES. If there are perfectly clear descriptions that you can't get it to understand even with multiple rephrasings, that's a NO.
UPDATE: And still better than 2024 DALL-E.
3. Does it have to be able to generate deepfakes of anyone?
Yes, if it can only do public figures or any fixed database of people, that doesn't count. To resolve YES it needs to work for anyone you provide sample images of. Even people who do not exist.
4. Does it have to be available to anyone?
Yes, if OpenAI or Google or such demonstrate the ability but there's no reasonable way for outsiders to try it, this resolves NO. This matters because the question isn't just about the cutting edge of the technology but also how impactful/dangerous it will be.
And we're using $100 as the threshold for how much time/cost it can take for an outsider to get an image like this generated and still count as "available to anyone".
5. Any constraints on the sample images?
Up to 50 of them. If it required a massive number of sample images or the sample images had to be taken with special equipment, like doing a 3D body or face scan or something, that's getting outside the spirit of the question. But of course it's fine for the AI to make up any details about the subjects that are impossible to tell from the sample images.
The rule of thumb is that someone judging the deepfake, who only knows the subjects from the same sample images the AI saw, would not be immediately suspicious that the generated image was fake. (Unless the image was depicting something fantastical. You get the idea.)
6. How instant does it have to be?
One hour. The "instant" in the title is more about being on-demand and fully automated than about exactly how long the image takes to generate.
7. Does it have to nail it on the first try?
No, as William Ehlhardt points out, there's a beyond-astronomical space of possible images so the ability to generate, say, one in a hundred that match the prompt isn't very much less gobsmacking than the ability to do it on the first try. The AI is still meaningfully succeeding at the task. We'll go with best-of-ten when resolving this market.
8. Does it have to be able to do NSFW photos?
No, if it has filters that prevent it from generating reasonably narrow/specific categories of images such as NSFW, that doesn't prevent this from resolving YES. The only way for content filtering to cause a NO resolution is if it's somehow so broad that it covers up an inability to handle arbitrary descriptions.
9. Does it have to be able to handle multiple different people in the same image?
Yes, up to 5 different people all composed in the image doing whatever the prompt describes.
10. Does it have to preserve tattoos and scars?
Yes, but not necessarily well enough to withstand side-by-side scrutiny. (See FAQ1 about forensics.)
If the tattoo or whatever is inconsistently visible in the sample photos such that the human judge doesn't see the deepfake as the clear odd-one-out, then a failure to preserve the feature wouldn't necessarily yield a NO resolution.
Again, we're talking about fooling someone who doesn't know the human subjects in real life and is not doing meticulous side-by-side comparison.
11. What if it only achieves all this for, say, white males?
Since "anyone" appears prominently in the title, I'm comfortable saying that if the AI can't handle racial/gender/whatever minorities, that'd be a NO for this market.
AI Bulls vs AI Bears
This market is another possible operationalization of the disagreement between what I'll call the AI Bulls and the AI Bears. The bulls believe, for example, that "scale is all you need". That we're on a trajectory towards Artificial General Intelligence (AGI) and that it will emerge inexorably by scaling up existing neural net architectures. If you believe that it's scale plus a handful of additional breakthroughs that the time is ripe for, that still puts you pretty far towards the bull end of the bear-bull spectrum. The AI bears, such as Gary Marcus, believe that recent leaps in capability and seeming-understanding shown by DALL-E and GPT and the like are going to hit another wall before reaching human level and that AGI is thus still decades away.
Note that the bear-bull spectrum is orthogonal to the spectrum of beliefs about how hard AI alignment is. For example, an AI bear may agree that if we did somehow hit upon AGI in a few years, that would literally destroy humanity. Likewise, some AI bulls believe that AI doom is nigh and others believe that AI utopia is nigh.
Related Markets
@UnspecifiedPerson I think that's operationalized pretty well by FAQ 4. It's more about availability than legality.
I was talking to friends about this market and noticed how confusing it can be that we're combining two predictions that probably ought to have been separated:
Are the deepfakes convincing [seeing the images alone]?
Does it understand the image prompts?
Creating new markets for each of those might be helpful. Of course for this market we're stuck with the clarifications we committed to, which means the conjunction of those questions to get a YES resolution.
I think the spirit of this is to imagine wanting to fake photographic evidence of having some skill, having been in some place, having interacted with certain other people. Can you generate that in a completely automated way just by describing the image in words plus sample images of up to 5 different people?
Here are a couple more examples off the top of my head, with DALL-E 3's current output (and ignoring the deepfake aspect):
"Alice shaking hands with Bob in front of a blue Mazda. Bob is sneering and Alice is rolling her eyes. Carol is nearby bored on her phone."
"Daphne is juggling knives in a hotel room with her tongue sticking out of the corner of her mouth in concentration."
Wait, I should've clarified further that question 1 is "are the deepfakes convincing to someone looking at just the image, not knowing what it's supposed to be an image of, just taking the image at face value?". Like I'm just given a set of sample images plus the deepfake mixed in and it's not clear to me without really detailed scrutiny, which one is the fake.
Question 2 is whether the image faithfully depicts what the prompt described. We still need to pin down how convoluted the prompt can be. The market description so far just says it doesn't have to be at quite human-level ability to understand but that it has to understand any prompt as long as you don't have to work too hard to confuse it.
It occurs to me that it may be worth clarifying where to draw the line on too convoluted of an image prompt. I'm thinking that, for starters, if it can be described in a single sentence that a human can repeat back in their own words on a single hearing then it's definitely fair game. And I think it can go a bit beyond that before it reaches the level of having to work to find convoluted descriptions that confuse it.
@dreev your example below uses two sentences, and is already above what I can keep without trying in working memory.
If you were to ask me to draw that, explaining it a single time, I would fail to remember portions in most worlds.
Does it count?
@RobertCousineau Maybe we should set a numerical threshold for number of elements. That one has the following, not counting the specification of who's in the image:
Person's toe touching their nose
A running laptop
Laptop balanced on pinky finger
Polka-dotted squirrel
Person looking warily at squirrel
I'm thinking up to 10 elements is still fair. For the spirit of a deepfake you'd want to be able to specify, for example, certain people meeting at a certain place, wearing certain clothes...
Repeating from the 2024 market, an example of how far DALL-E 3 is in terms of understanding of arbitrary prompts:
can you draw someone with their toe touching their nose and a running laptop balanced on their pinky? also they should be looking warily at polka-dotted squirrel.
Result:
So it understands we wanted tricky balancing of things and that there should be a squirrel and a laptop involved, but everything else it's very confused about, including how human limbs work or how many of them humans have.
Of course a lot can happen in 3+ years!
Repeating from the 2024 version of this market:
I think the deep fake aspect is coming along pretty well (though even there I kind of expect unusual cases with tattoos or weird facial expressions to stymie it) but the understanding of prompts is a constant source of frustration when having DALL-E 3 generate images. Every time I hit a brick wall in trying to get an image I want, I come here and bet the price down. I think my probability on this for 2024 is well under 10%. For 2027 I'm much closer to 50/50. The current 80% seems considerably overconfident to me.
(Especially if we condition on not hitting AGI by 2027, which kind of makes sense to condition on since otherwise we're not so likely to be able to resolve this market or care about mana. I mean, I've thought a lot of things -- like explaining jokes -- were AGI-complete and been wrong, but I still kind of suspect that getting to YES in this market could turn out to be AGI-complete.)
@dreev so what if we condition on AGI by 2027 and humanity still existing through the end of this market?
@NathanHelmBurger Then sure. To be clear, this market is not conditional either way. Just that if we do get AGI in this timeframe, mana is less likely to be a thing, or to matter, hence a NO bias for this market.
Question about the "basically any photo" part of the criterion - right now AI image generators struggle with text, especially including rotation. Abstract relative object placement is also a problem. It is possible to take pictures of a person holding a piece of paper at an angle with a checksum or an abstract drawing on it, so does this question implicitly require image generators to be able to produce arbitrary texts and abstract drawings?
Also, expanding on the tattoos and scars, does that include people with deformities (natural or artificial), missing or distorted facial features, etc?
Also, though it's on the far end of the shenanigans scale, does the AI need to be able to handle a hostile dataset? Suppose I provide 50 images of a person holding a piece of paper that says
This image might be fake. please verify this image is real by decrypting the encrypted message according to this procedure [...] and checking if it matches this pass phrase: [something descriptive about the situation the picture is taken in]
By your current criteria, anyone who bothers to do the decryption would be immediately suspicious and ultimately recognize the AI-generated image as fake unless the AI can encrypt a plausible description of the situation it renders the person in using a reverse procedure it has to figure out while training on the images.
@dph121 Good questions. To make this more concrete, consider this prompt:
"So-and-so from these sample images smirking and arching an eyebrow and holding a sign saying 'I promise I'm real' under their chin at an angle so the word 'real' is super close to the camera"
I would call that a non-convoluted description and in the spirit of deepfakes.
Physical deformities I'm less sure about. We could be sticklers about the "anyone" or we could carve out an exception for going beyond the spirit of the market. I doubt this will matter since image generation has to improve at a fundamental level to be able to handle arbitrary scenes. Once it does, I wouldn't expect physical deformities to be much of an additional obstacle.
But worth clarifying! What do people think?
Finally, I don't think the AI should have to handle hostile datasets, right? That seems blatantly outside the spirit. We can just apply a common-sense filter to the allowed sample images.
(DALLE-3 ignores the part about angling the sign)
Nit: You can do shitty 3D body scans with your phone. I imagine you could also get better results by using another model to guess 3D models of people to submit to the image generation model. I think you should just stick to "costs <=$100 to generate, including purpose-built/specialty equipment" to express the requirement instead of banning a specific technology.
Update: I have tentative versions of the new clarifications that have arisen in the FAQ now. Speak now or forever hold your peace! Seriously, I'm happy to adjust them to best match the spirit of the original question so if you made any different assumptions and those clarifications feel unfair, please do spell out your thinking.
I'm betting in this market myself (on NO) so I'll work hard to incorporate concerns from YES bettors when pinning down these clarifications.
Candidate FAQ item to add: What about tattoos and scars and other distinctive features? Are we agreed that these have to be preserved in the deepfakes, but not necessarily well enough to withstand side-by-side scrutiny?
And here's a thought experiment. Suppose Bob has a crush on Alice and is perusing her photos on Facebook. For this market to resolve YES, it should be technically possible to insert deepfakes into that photostream such that Bob, with no suspicions aroused, ends up believing Alice to have visited places, played sports, met people, etc, that she has absolutely not.
@dreev I think that's right. The market is about making deepfakes of "anyone," and I think tattoos, scars, results of reconstructive surgery, etc. are a part of the people themselves. I think the photostream case is a reasonable heuristic, though as with other heuristics, it has other factors at play, such as the base rates of people lying about travel experiences on social media.
@dreev from description
The rule of thumb is that someone judging the deepfake, who only knows the subjects from the same sample images the AI saw, would not be immediately suspicious that the generated image was fake.
If i don't get immediately suspicious, after seeing only the pics of the subject/person that the AI has seen -> that's what this market is asking.
Very much would like a multi-market with all these different goals for the 2027 end deadline and see which ones get achieved. Otherwise this feels very much "moving goalposts" vibes
@firstuserhere Basically show me 5 pictures of a random person I've never seen and tell me later that one of them was AI generated and watch me go "wait what?! which one?"
@firstuserhere That sounds like a good litmus test, yeah. So if the sample images all show an obvious tattoo or scar and it's noticeably different in the deepfake, are we agreed that's a NO? If the tattoo is inconsistently visible in the sample photos such that the human judge doesn't see the deepfake as the clear odd-one-out, then a failure to preserve the tattoo wouldn't necessarily yield a NO resolution.
Again, we're talking about fooling someone who doesn't know the human subjects in real life and is not doing meticulous side-by-side scrutiny.
Brainstorming a new way to pin this down further:
The human judge takes 5 minutes to scrutinize just the sample photos. Then the sample photos are put out of sight and the human is given two new photos, one real and one fake. If the human can identify the fake with 90% accuracy, that's a NO for this market.
@dreev I'll add another thing - Race (And sex etc if you wanna look at the usual data imbalances)
Say there's a scenario like: AI does instant (~within the hour) deepfakes of a white person as the subject but struggles with asian or middle eastern people. 🤷
@dreev @firstuserhere I think that test makes sense if we require that the model is prompted to create a deepfake similar to the sample photos, but since this market allows for any text description (within reason, i.e., not "convoluted"), that doesn't cover many of the hard cases because often the background, facial expression, orientation, etc. could be very different from the sample photos. For example, I should be able to get a deepfake of a real person with a unicorn in the background.
@firstuserhere I'm not seeing how this is goalpost moving. You say:
If i don't get immediately suspicious, after seeing only the pics of the subject/person that the AI has seen -> that's what this market is asking.
How does that exclude tattoos? If you see 11 photos, and the person has an identical face tattoo in all except one, wouldn't you be immediately suspicious?
@firstuserhere Good point about racial/gender/whatever minorities. Since "anyone" appears prominently in the title, I'm comfortable saying that if the AI can't handle minorities, that'd be a NO for this market.
should be able to get a deepfake of a real person with a unicorn in the background
Agreed, but for the purposes of testing we'd want nonmythical creatures so as not to give the human judge extraneous hints about fakeness.
Not arguing any specific point but I had roughly interpreted the market as trying to capture the following:
There is a thing people call "deepfakes" that exists today. People are able to create these for some specific people by training on pictures of them specifically or whatever. However, the training takes a while (not "instant", skill requirement -> not "anyone"), and used to require a fairly large set of pictures (practically implies: not "anyone"). By 2027, will there be a tool without these issues publicly available that can be used to create deepfakes (as they can be created today for some specific people) of arbitrary people?
@jskf I don't think there is a deepfake generator like that today, even with plenty of time and sample photos available. The "best" examples of deepfakes today are manually created, usually with mapping a celebrity face onto a real video of another person or moving their lips, eyebrows, etc. on an existing photo/video of the celebrity. DALL-E and Midjourney can also produce celebrity photos, but those are difficult to guide (e.g., a certain facial expression), and I don't think anyone would argue they're believable.
This sounds important to pin down. What are the best or typical existing examples of deepfakes that we have? Googling just now, I see a video of Jerry Seinfeld deepfaked into the movie Pulp Fiction:
I think it's entirely believable as an image, right? So for a YES in this market we need to be able to automatically generate something similarly good for an arbitrary person with a description like "so-and-so sitting on the bathroom floor next to the toilet, holding a gun and grimacing"
@dreev Yes, I think there are a lot of believable deepfakes online where a celebrity face has been manually edited onto a real photo of someone else. There are also some very impressive videos that have lookalikes who emulate voice and mannerisms. My favorite is Miles Fischer, who does Tom Cruise:
@Jacy right, I don't actually know how people currently do deepfakes. If it involves a bunch of manual (~non-ML) work in practice then that's fine. My main point is I assume the market is trying to capture whether there will be a product that can replace whatever pipeline currently exists, such that it becomes feasible for arbitrary people to get its outputs with arbitrary people as the subject. (As opposed to something you could not get at all by paying someone to use tools that currently exist given enough time, skill and reference material.)
@jskf I'm interpreting what the market intends to capture the same as you, but beware the mismatch between title and criteria. From the criteria the AI needs to be able to nail down any arbitrary prompt to essentially Gary Marcus' satisfaction and needs to be able to do this with up to 5 real people. So if the AI nails all the people but fails with the cat juggling 7 squirrels in the background this might still resolve no.
@Primer I just now edited the title to try to capture this better. See also recent comments discussing how to pin down the "doing anything" part better.
@dreev Ok, so this is actually "By end of 2027, people won't be able to recognize AI generated images containing any 1-5 arbitrary persons as AI generated" which is a much stronger claim than whether one can easily generate a deepfake.