A team at OpenAI is working to solve the alignment problem. Short of asking whether they will succeed altogether, this question gauges whether it will be publicly known before Jan 1, 2027 that OpenAI has made a significant breakthrough in the alignment problem. The technical details of the breakthrough do not need to be public as long as OpenAI officially announces it and provides evidence, such as a live demonstration or system card, showing what they've achieved.
The resolution criteria for "significant breakthrough" is subjective, so I will not bet on this question. I am looking for breakthroughs roughly as significant for alignment as the Transformer was for DL. Here are some example breakthroughs that I think would qualify:
Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")
During training of a large RL model, robustly predict using model weights alone if or how goal misgeneralization will occur in examples far outside the training distribution
Solve polysemanticity
Detect and demonstrate deceptive alignment in a language model and identify the circumstances under which it develops during training
Introduce a new model architecture that has significant empirical or theoretical advantages over Transformers with respect to alignment in particular, without significantly improving on its capabilities
Something I haven't mentioned, on an "I know it when I see it" basis. I'm open to community discussion on what qualifies.
If the team dissolves or significantly reorganizes before announcing such a breakthrough, this question resolves NO.
From the title, I would bet YES on this, but "roughly as significant for alignment as the Transformer was for DL" is a very high bar, given all of the LLMs like ChatGPT have been Transformers without comparable advances since (maybe unless scaling is considered a breakthrough). I expect the SuperAlignment project to have at least one advance that they report as being extremely important (e.g., a better way to incorporate human feedback than RLHF/PPO) but not nearly that significant.
@Jacy yeah it does seem like the criteria here is "an alignment advanced far greater than any we've had before" which is a high bar
This question aims to measure OpenAI's success based on their own incredibly ambitious goal as stated in the announcement:
https://manifold.markets/Primer/will-the-openai-achieve-very-high-l?r=UHJpbWVy
Is there any example of existing work that you'd have considered significant breakthrough at the time? (ex SoLU or constitutional ai or IoI circuit or anything else)
"Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")"
This doesnt seem like a significant breakthrough in terms of "solving alignment" even if the work will be impressive
@Feanor It would represent a huge advance in mech interp on large models, which would be pretty relevant, though I'm open to more detailed discussion on why it wouldn't be significant.
@Khoja it'd be significant in mech interp for sure, but i dont think that their stated goals would have this qualified, esp having the broader ai safety community agree that this is extremely relevant
Imagine telling someone in gofai twenty years ago that "figuring out how an AI operating on text adds numbers would be a huge breakthrough"...
the other thing is that I don't think the way gpt adds numbers is going to be particularly surprising? Doing that will teach us more about how to do mechanistic interpretability, but not anything about how gpt-3 "does all of the interesting stuff it does", i think?
@jacksonpolack yeah it will be a good advance in mech interp, but i doubt the alignment community in general will judge it as a breakthrough