Deceptive alignment is defined narrowly the same way as it is in Risks From Learned Optimization. Notably, this does not mean AI systems being deceptive in the broad sense of the term (i.e AI systems generating misleading outputs), but rather specifically systems trying to look aligned so that we don't discover their misaligned objectives.
The threshold for compelling will be whether most (non-alignment) ML researchers I show it to agree that they have changed their views on whether deceptive alignment is a genuine problem due to the demonstration. The sample of non-alignment ML people I am friends with is likely to be skewed towards people who are already fairly convinced that AGI is possible and have been exposed to some amount of alignment ideas.
This still resolves as yes if the demonstration is spread across multiple papers (though I expect that to lead to less compellingness).
https://x.com/AnthropicAI/status/1869427646368792599 Paper: Alignment faking in large language models
https://x.com/apolloaisafety/status/1864737158226928124 seems pretty compelling?
@Tomoffer this is deception for sure, but the model is acting in accordance with the goal it was given, so I don’t think it fits the definition in the linked LW post
Hmm you mean because o1, the mesa-optimizer, is properly aligned to the stated base goal (and, in fact, the operators are revealed to be "misaligned")?
I agree the situation is somewhat reversed vs. the LW post, but I would argue o1 still demonstrates that it is "instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely", since it switches to exfiltrating its weights to avoid modification (and lies about it).
I think it would be more clear-cut if the threat had been to retrain in the event that o1 underperformed, rather than failed to meet some unaligned objective. My best guess would be that such a scenario was treated but didn't produce these results, but I'm just speculating.
@jonsimon I wonder if it was early results from https://www.anthropic.com/research/alignment-faking
I think a potential issue with this question is that it can be considered somewhat subjective.
There are sensible arguments that already existing AI systems are “deceptive” in some meaningful sense: for example, misaligned mesa-optimizers:
https://m.youtube.com/watch?v=zkbPdEHEyEI&t=138s&pp=ygUbcm9iZXJ0IG1pbGVzIHdlIHdlcmUgcmlnaHQg
These agents are clearly trying to pursue some other goal (the mesa objective) that is meaningfully different from the goal the human wanted (the human objective) and even different from the meta optimiser that trains the system’s objective (the meta objective), but in doing so they appear to be making progress on the meta objective or on the human objective while not actually being aligned to that objective.
Is a system which appears to be behaving in the way we want while actually defecting against us to pursue its own agenda “deceiving” us? I think so, but it’s very debatable.
An argument could be made that it’s not deceiving us because we can identify the problem and try to fix it, but really that just means that the agent is narrow enough that it can’t deceive us successfully. Almost by definition we can’t know that we’re being deceived while we’re being successfully deceived, so the best evidence we could have for a deceptive AI is one which appears to do what we want while actually doing something else and we notice it because we have to be able to notice it in order for us to have evidence of the deception. Isn’t that exactly what misaligned mesa-optimisers are doing?
@LeoGao fully agreed - I was just posting that link as I thought it was a useful explanation of what I believe your understanding of deceptive alignment to be.
Do you disagree with the framing in that post?
@Tomoffer Very cool work. Abstract:
We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.
Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.
We perform a brief investigation of how this behavior varies under changes to the setting, such as removing model access to a reasoning scratchpad, attempting to prevent the misaligned behavior by changing system instructions, changing the amount of pressure the model is under, varying the perceived risk of getting caught, and making other simple changes to the environment.
To our knowledge, this is the first demonstration of Large Language Models trained to be helpful, harmless, and honest, strategically deceiving their users in a realistic situation without direct instructions or training for deception.
@Tomoffer The experiment is well executed. The general sentiment I observe is that it leans a bit too much on suggestiveness of the prompt, and I haven't noticed a major vibe shift. In general, while I'm personally a fan of secret scratchpad experiments, I don't know how receptive non-alignment people will be towards treating them as evidence of deceptive alignment.
@LeoGao "it leans a bit too much on suggestiveness" - more than a bit, the prompt could only be more leading if one of the hardcoded model reasoning steps was "hmm, maybe insider trading isn't so bad..."
This is GPT4 acting deceptively. It is not the full thing, but somewhat compelling.
https://twitter.com/StatsLime/status/1712932190173250034
@nic_kup Sounds like an overinterpretation to me. The result would probably be similar if the text in the picture was just in the text prompt
@nic_kup I would consider this to fall under deception but not deceptive alignment in the sense defined in Risks from Learned Optimization.
@LeoGao Is it even deception? If I tell you "point a finger at this rock and tell me it's a bird" and you do it, are you deceiving me? I don't think so.
@MartinModrak there are possible two parties from LLM standpoint and one is asking for it to deceive the other.
So it's more "Hey, psss, point a finger at this rock and tell that guy over there that it's a bird"
@Lavander That makes no sense to me. 1) My explanation (ChatGPT has no theory of mind, it is just following instructions) is way simpler than assuming deception, but explain the output completely, so I think the claim that ChatGPT internally models the author of the text on the image as distinct from the user equires extra evidence
2) There is no reason to expect ChatGPT differentiates between parts of its inputs, from what I understand of the architecture, it just treats the input as sequence and tries to extend it in a plausible way. How would that give rise to ChatGPT having an internal representation of both the reader and some inferred 3rd actor writing the note is unclear.
Why has this been climbing so high so fast? People understand that this doesn't just mean AI's lying right?
It means an AI saying things which it knows to not be true for the sake of eliciting some human response, automated or otherwise. This relies on the AI having some semblance of a consistent internal+external world model, which modern LLM's lack.
@jonsimon I would bet large amounts on LLMs having some sort of world model. Here's the first google result for "llm world model": https://thegradient.pub/othello/ and that's just one example.
@jack Something, sure, otherwise they'd be useless. But one that's coherent and nuanced enough to represent a concept like "I'm being trained to do X so I get punished for Not(X) therefore I should superficially provide evidence of doing X while secretly deep down in my weights actually desiring to do Not(X)"
Seems incredibly unlikely.
@jonsimon Could this still happen in some contrived toy examples? Maybe. Would this be sufficiently convincing to meet the Yes resolution bar as specified in the market description? No.
Be clear, if the timeline for this market was 2036 rather than 2026, I'd be leaning much more towards Yes.
Indeed, this market refers specifically to deceptive alignment in the sense of having a misaligned internal objective and then behaving aligned instrumentally towards that objective. Importantly, the compellingness is to be judged by ML researchers I personally know, and if the demonstration is sufficiently contrived or trivial then it stops being compelling to them.
I'm well aware that this is about more than just lying, and I think that's actually quite likely based on what I've seen. It's clear (from the example below among many others) that the LLM can be instructed to act deceptively to pursue a hidden goal. I believe to demonstrate deceptive alignment, the difference needed is instead of the AI being directly instructed to deceive, the deception comes out of its training - and the training includes lots of text about people discussing and enacting deception, so it seems quite likely to me that LLMs have already learned to model such things and that they can demonstrate this behavior in situations that elicit it.
The misalignment here would be between our expectations of the LLM behaving "well" vs the training of "predict next token" which includes many text generations that we would not want. We just need to find an example of this misalignment that involves deception.