Will 'jailbreaks' in large language models be solved in principle by the end of 2024?
Basic
83
12k
2025
10%
chance

Davidad, Programme Director at the UK's Advanced Research Invention Agency has publicly stated on his Twitter that he expects LLM 'jailbreaks' to be a solved problem by the end of 2024.

https://x.com/davidad/status/1799261940600254649

He cites Zou et al's new paper on short circuiting as pushing him over the edge on public willingness to state this: https://arxiv.org/abs/2406.04313

However even if jailbreaks are solved in principle this year, I am skeptical AI companies will immediately deploy them due to the relative nondamage of current model glitches and ire drawn by users for overzealous restrictions.

Therefore this market resolves YES if three "Davidad tier epistemic figures" (in my subjective judgment) make public statements that they believe jailbreaks have in fact been solved in principle before the end of this year. Davidad's existing tweet doesn't count because it's a prediction, not a statement of something he believes has already occurred. The public figures should:

  • Have substantial relevant expertise and follow the literature

  • Be known for their relatively balanced evaluations of object level events in AI

A list of people whose endorsements I think would qualify:

  • Myself

  • Davidad

  • Quintin Pope

  • Zvi Mowshowitz

  • Jan Leike

  • Jack Clark

  • Janus (@repligate on Twitter)

  • Zack M. Davis

  • Neel Nanda

This list is by no means exclusive however. For the purposes of this question "solved" means something like Davidad's definition of (paraphrased) "would give reasonable certainty you could put a model with dangerous capabilities behind an API and reliably expect those capabilities not to be elicited by users".

Get Ṁ600 play money
Sort by:

Making LLMs robust is looking much easier than for discriminative vision models.

Was going to bet 'no', but then I saw I was listed as a potential judge / relevant opinion haver. I'll let this comment register my prediction without introducing potential COI.

A list of people whose endorsements I think would qualify:

  • Myself

  • Davidad

  • Quintin Pope

  • Zvi Mowshowitz

  • Jan Leike

  • Jack Clark

  • Janus (@repligate on Twitter)

  • Zack M. Davis

  • Neel Nanda

This list is by no means exclusive however.

Noting the open-endedness of this list, and that both the market creator (“myself”) and Davidad are YES holders.

@CharlesFoster Yes this is clearly a subjective market, you either trust me to be realistic about this or not. I would much rather say I was totally wrong than cheat you (and be publicly seen to cheat you no less) out of manabux.

If you'd like I can sell my shares now and refrain from further betting in this market.

@JohnDavidPressman Understood. You and others can interpret my continued NO holdings on this market as evidence of my trust that y’all will resolve this fairly.

@CharlesFoster Yeah, I mean you can invert the question like "Would JDP be willing to tweet that he thinks jailbreaks are solved when he doesn't actually think that under his real name in exchange for 25 cents/5 bucks/whatever?"

My personal expectation is that if I mark this one yes it's at least 50/50 that I will not be one of the names on the 'expert' list. I would only want to use myself if no one else is available and I really do think on reflection that jailbreaks are solved in a way I'm excited to tell everyone about.

@JohnDavidPressman In general if jailbreaks are solved in principle this should be something there is relatively broad consensus about. It shouldn't be a thing where I need to nickel and dime it by picking the exact 3 people who happen to think this thing for idiosyncratic reasons and if I find myself doing that it's a forecasting smell.

@JohnDavidPressman How about this, if I resolve YES on this question and my endorsement is used it needs to be an actual full blog post explaining why I think this. You can all then dunk on me in the Gary Marcus future that follows if I'm wrong for being a Big Language Model Research stooge.

bought Ṁ250 NO

Right now, the title and the resolution criteria are a bit divorced IMO. For the title, I would’ve expected resolution criteria to be based on empirical attack success rates in a white-box (or black-box) setting, either by the original authors or by independent researchers.

@CharlesFoster My reasoning for not doing that was it might take longer for that evidence to come out than time left in the year when the relevant research drops. I think in practice it takes a while to really validate a solution like this (e.g. what happens when you put it in front of pentesters at DefCon?) and the thing I'm really trying to capture is experts looking at the procedure, results, and previous literature and go "Yup this is it, this is the one." which precedes the full validation of the method. Hence the "in principle" in the title.

opened a Ṁ500 NO at 70% order

My guess is that it's fairly plausible that jailbreaks be solved for small context length in 2024, but that they are much less likely to be solved for models with very long context length (500k+ tokens). I wonder if this market considers this possibility as "jailbreaks solved" or "jailbreaks not solved", since it's pretty trivial to just limit context length and 500k+ tokens is rarely useful / used.

@Bayesian It's supposed to be based on whether relevant experts endorse the statement "jailbreaks are solved in principle" or similar. I think context length probably won't matter much because it should all add up to interpretable features in the end, but I could be wrong.

fwiw if it’s solved at 128k or even 32k I would consider it solved. if limited to 4k, no.

@davidad Same.

davidadboughtṀ450YES

epic

bought Ṁ200 YES

Personally I think the writing has been on the wall since the representation engineering paper came out:

https://x.com/jd_pressman/status/1718791247056543994

Even on casual inspection this checked all my boxes and I had a strong hunch this research program would be successful. If Zou claims his new results are adversarially resistant it's more likely than not that they are. Not more confident right now because I haven't read this new paper beyond the headline yet.

More related questions