Will relaxed adversarial training be used in practice for LLM alignment or auditing before 2028? | Manifold

Will relaxed adversarial training be used in practice for LLM alignment or auditing before 2028?

Basic

9

Ṁ159

2027

79%

chance

1D

1W

1M

ALL

A number of figures in the alignment community have expressed interest in relaxed adversarial training as a technique for model alignment or auditing that may see wider use in the future, much as how RLHF eventually became a standard component of large model alignment. Examples:

Short of asking whether relaxed adversarial training will become as ubiquitous as RLHF, this question instead asks whether there will exist a technique that uses RAT to measurably improve the safety of the largest models. The question resolves YES if, before 2028, it is publicly known that there exists a technique that:

Involves relaxed adversarial training, i.e. targeted perturbations to model latents when auditing model behavior, or as any component of a training objective, or as part of any process to improve a model's OOD robustness

Can scale to models larger than GPT-3
Usefully improves some axis of safety, such that at least one AI lab with a market cap over $1 billion is publicly known to implement it for their models. It doesn't necessarily have to be the best technique of its class, but it has to work and be used

Relaxed adversarial training for inner alignment — AI Alignment Forum

This post is part of research I did at OpenAI with mentoring and guidance from Paul Christiano. It also represents my current agenda regarding what I…

#Technical AI Timelines

#️ AI Alignment

#Alignment Research Agendas

Get

1,000

and

1.00

Sort by:

reposted

Looking very likely imo with this paper
https://arxiv.org/abs/2407.15549

reposted

Reposting in the context of https://arxiv.org/abs/2403.05030

Related questions

Will a LLM-based AI be used for a law enforcement decision before 2025?

Will Google have a better LLM than OpenAI by 2025?

Will any widely used LLM be pre-trained with abstract synthetic data before 2030?

Will OpenAI have the best LLM in 2024?

Will any foundation models/LLMs be able to reliably come up with novel unparalleled misalignments before EOY 2024?

By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

At the beginning of 2028, will LLMs still make egregious common-sensical errors?

Will the most interesting AI in 2027 be a LLM?

Will I start using a non-LLM AI tool on a daily basis before 2025?

Related questions

Will a LLM-based AI be used for a law enforcement decision before 2025?

By 2027, will it be generally agreed upon that LLM produced text > human text for training LLMs?

Will Google have a better LLM than OpenAI by 2025?

By 2025 end, will it be generally agreed upon that LLM produced text/code > human text/code for training LLMs?

Will any widely used LLM be pre-trained with abstract synthetic data before 2030?

At the beginning of 2028, will LLMs still make egregious common-sensical errors?

Will OpenAI have the best LLM in 2024?

Will the most interesting AI in 2027 be a LLM?

Will any foundation models/LLMs be able to reliably come up with novel unparalleled misalignments before EOY 2024?

Will I start using a non-LLM AI tool on a daily basis before 2025?

Terms & Conditions•Privacy Policy•Sweepstakes Rules