Will OpenAI's Superalignment project produce a significant breakthrough in alignment research before 2027?

163

2.2kṀ38k

resolved May 17

Resolved

ALL

A team at OpenAI is working to solve the alignment problem. Short of asking whether they will succeed altogether, this question gauges whether it will be publicly known before Jan 1, 2027 that OpenAI has made a significant breakthrough in the alignment problem. The technical details of the breakthrough do not need to be public as long as OpenAI officially announces it and provides evidence, such as a live demonstration or system card, showing what they've achieved.

The resolution criteria for "significant breakthrough" is subjective, so I will not bet on this question. I am looking for breakthroughs roughly as significant for alignment as the Transformer was for DL. Here are some example breakthroughs that I think would qualify:

Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")
During training of a large RL model, robustly predict using model weights alone if or how goal misgeneralization will occur in examples far outside the training distribution
Solve polysemanticity
Detect and demonstrate deceptive alignment in a language model and identify the circumstances under which it develops during training
Introduce a new model architecture that has significant empirical or theoretical advantages over Transformers with respect to alignment in particular, without significantly improving on its capabilities
Something I haven't mentioned, on an "I know it when I see it" basis. I'm open to community discussion on what qualifies.

If the team dissolves or significantly reorganizes before announcing such a breakthrough, this question resolves NO.

AI Doom

OpenAI

AI Alignment

Alignment Research Agendas

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ1,935
2		Ṁ1,361
3		Ṁ1,255
4		Ṁ622
5		Ṁ472

People are also trading

Will we solve AI alignment by 2026?

4% chance

Will a >$10B AI alignment megaproject start work before 2030?

32% chance

Will xAI significantly rework their alignment plan by the start of 2026?

63% chance

Will Meta AI start an AGI alignment team before 2026?

38% chance

Will some piece of AI capabilities research done in 2023 or after be net-positive for AI alignment research?

81% chance

Will OpenAI publicly state that they DON'T know how to safely align a superintelligence, after 2027?

23% chance

Conditional on their being no AI takeoff before 2030, will the majority of AI researchers believe that AI alignment is solved?

34% chance

Conditional on their being no AI takeoff before 2050, will the majority of AI researchers believe that AI alignment is solved?

52% chance

Will ARC's Heuristic Arguments research substantially advance AI alignment before 2027?

26% chance

Will deceptive misalignment occur in any AI system before 2030?

Sort by:

bought Ṁ400 NO

https://www.bloomberg.com/news/articles/2024-05-17/openai-dissolves-key-safety-team-after-chief-scientist-ilya-sutskever-s-exit

From the title, I would bet YES on this, but "roughly as significant for alignment as the Transformer was for DL" is a very high bar, given all of the LLMs like ChatGPT have been Transformers without comparable advances since (maybe unless scaling is considered a breakthrough). I expect the SuperAlignment project to have at least one advance that they report as being extremely important (e.g., a better way to incorporate human feedback than RLHF/PPO) but not nearly that significant.

@Jacy yeah it does seem like the criteria here is "an alignment advanced far greater than any we've had before" which is a high bar

predictedYES

Arb:

This question aims to measure OpenAI's success based on their own incredibly ambitious goal as stated in the announcement:

https://manifold.markets/Primer/will-the-openai-achieve-very-high-l?r=UHJpbWVy

Will OpenAI achieve “very high level of confidence” in their “Superalignment” solutions before 4 years are over?

11% chance. On July 5, 2023, OpenAI announced their “Superalignment” initiative: https://openai.com/blog/introducing-superalignment Our goal is to solve the core technical challenges of superintelligence alignment in four years. While this is an incredibly ambitious goal and we’re not guaranteed…

predictedYES

Is there any example of existing work that you'd have considered significant breakthrough at the time? (ex SoLU or constitutional ai or IoI circuit or anything else)

I think “analogous to the Transformer” is a high bar that none of these examples quite meet.

"Identifying the circuit that does addition in GPT-3, showing how it develops during training in some mechanistic detail, and editing model weights directly to either remove or introduce specific errors in its process (like "when you carry a digit, carry it two digits over instead of one")"

This doesnt seem like a significant breakthrough in terms of "solving alignment" even if the work will be impressive

@Feanor It would represent a huge advance in mech interp on large models, which would be pretty relevant, though I'm open to more detailed discussion on why it wouldn't be significant.

@Khoja it'd be significant in mech interp for sure, but i dont think that their stated goals would have this qualified, esp having the broader ai safety community agree that this is extremely relevant

Imagine telling someone in gofai twenty years ago that "figuring out how an AI operating on text adds numbers would be a huge breakthrough"...

the other thing is that I don't think the way gpt adds numbers is going to be particularly surprising? Doing that will teach us more about how to do mechanistic interpretability, but not anything about how gpt-3 "does all of the interesting stuff it does", i think?