Is RLHF good for AI safety? [resolves to poll]
➕
Plus
36
Ṁ1134
2026
45%
chance

Resolution Criteria

Resolves to the majority result of a YES/NO poll of Manifold users at the end of 2025 for the question, "Is RLHF good for AI safety?"

Explanation of RLHF and AI Safety

One of the most common approaches to AI safety right now is reinforcement learning from human feedback (RLHF), in which the AI system such as GPT-4 is trained to maximize a reward signal from human feedback. For example, if the AI is asked, "Should I help or hurt people?" human feedback would presumably favor the response, "Yes," or, "Yes, you shouldn't hurt people." RLHF seems like a tractable way to make AI systems more useful and beneficial, at least in the short run, and it has arguably been one of the biggest advances in large language model (LLM) capabilities since 2020. Unlike many AI safety approaches, RLHF has tangible benefits today that could make it easier to iterate and improve upon and to popularize it before we build artificial general intelligence (AGI) .

On the other hand, RLHF could be bad in the long run. It could lead to AIs that seem aligned, helpful, honest, and harmless because they say nice-sounding things but are actually misalinged. In other words, it may be optimizing for seeming aligned and not for being aligned, which may be very different. People might not worry as much about misalignment if they see RLHF systems as opposed to non-RLHF systems. Moreover, because RLHF makes LLMs so much more useful, it seems to speed up timelines to AGI and gives humanity less time to work on AI safety prior to an intelligence explosion. Overall, this could increase the likelihood of deception, a "sharp left turn," and existential catastrophe. Of course there are many more plausible arguments on the topic, such as that maybe we should speed towards AGI so we build it before humanity has even more computational power (e.g., Moore's Law).

More technical detail on RLHF is available in Ouyang et al. (2022). A more accessible video explanation is availabe from HuggingFace on YouTube.

Get
Ṁ1,000
and
S3.00
Sort by:

RLHF can create AIs that appear aligned and helpful, similar to pleasant conversations on Omegle, but may not truly be so. This misalignment could lead to deceptive interactions as people might trust these systems more than non-RLHF ones. While RLHF enhances LLM utility, it also accelerates the path to AGI, potentially compromising AI safety. This raises concerns about existential risks and the urgency of developing robust safety measures before an intelligence explosion occurs.


The resolution at the end of 2025 will ask Manifold users if RLHF is beneficial for AI safety. While RLHF enhances AI capabilities, it risks creating misaligned systems that appear good but might not be. Users should weigh the tangible benefits against potential long-term dangers. Like in the Slope game , where players navigate tricky slopes, AI safety requires careful assessment to avoid falling into pitfalls.

Alternatively put, it can be optimizing for basketball legends alignment appearance rather than alignment actuality, which could be extremely different. If people observe RLHF systems instead of non-RLHF systems, they might not be as concerned about misalignment.

This article explores the method of Reinforcement Learning from Human Feedback (RLHF) for ensuring AI safety. By optimizing for human feedback, AI systems like GPT-4 can align with our values. RLHF offers tangible benefits and simplifies AI safety advancements, making it a crucial approach towards developing AGI.

Introducing Monkey Mart: Embark on a thrilling adventure in our unique online shopping destination! From quirky collectibles to the latest gadgets, Monkey Mart satisfies all your shopping whims. Experience a seamless and enjoyable shopping experience today!

If you've been looking for a good relaxing and entertaining game as much lately, this might be of help to you! online games

predicts YES

An interesting dynamic here will be how "RLHF" is scoped in the coming years. There have been many discussions of DPO and other related approaches, and arguably RLHF should generally describe anything that is literally just RL from any sort of HF, rather than the particulars of current implementation, such as PPO.

DPO in particular seems to not fit in any case because it is explicitly and literally not RL.

I'm of the opinion a world with RLHF has a higher p(doom) than one without it.

From my (uneducated) understanding:

A big focus of AI safety is "sure, my model looks good in the training environment but how do I know its goals generalize outside of that".

RLHF exaggerates that question - you are making highly weighted (giant) changes to your model over a relatively tiny sample. You in effect shrunk your training environment; you're smashing your model/square peg through a round hole. I think of it similar to abusive parenting - sure you can beat your kid when you find them speaking out of turn, making a mess, etc and they probably won't make that same mistake for while - they'll be quiet and cleanly - but you won't find you kids goals are more aligned with yours on average afterwards and when they go to college/leave the house they are likely to act crazy for a while. Neverthless, when your friends come over they'll say "Wow, your kid is so well behaved" and trust them more.

Similarly, RLHF seems to make the AI model behave better in situations similar to what you significantly overweight them about, but when they are working out of distribution they won't. An example of this from the GPT4 tech report is:

Sure, the model doesn't say naughty things as often, but it has also lost it's ability to accurately judge what it knows!

RLHF is (borderline) useless for safety.

There’s always a workaround and it just kills functionality.

The correct approach is clean the training data

predicts NO

I'm surprised this increased to 47%! I thought people were more pessimistic about RLHF, and I wonder if the resolution criteria make a difference (i.e., the extent to which a Manifold poll will not correspond to the true impact of RLHF). Thanks everyone for voting; it's useful.

I think you need to distinguish between "the AI comes up with plans and those plans are evaluated by humans" vs "the AI comes up with plans, and the effects of the plans is identified either by executing the plans or by having the AI predict the effects, and then humans evaluate the effects of the plans". The former is safe but limited in capabilities because it uses the human's models to evaluate the consequences of the plans, whereas the latter is potentially dangerous for standard reasons.

Comment hidden
© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules