Will relaxed adversarial training be used in practice for LLM alignment or auditing before 2028?
Basic
9
Ṁ159
2027
79%
chance

A number of figures in the alignment community have expressed interest in relaxed adversarial training as a technique for model alignment or auditing that may see wider use in the future, much as how RLHF eventually became a standard component of large model alignment. Examples:

Short of asking whether relaxed adversarial training will become as ubiquitous as RLHF, this question instead asks whether there will exist a technique that uses RAT to measurably improve the safety of the largest models. The question resolves YES if, before 2028, it is publicly known that there exists a technique that:

  • Involves relaxed adversarial training, i.e. targeted perturbations to model latents when auditing model behavior, or as any component of a training objective, or as part of any process to improve a model's OOD robustness

  • Can scale to models larger than GPT-3

  • Usefully improves some axis of safety, such that at least one AI lab with a market cap over $1 billion is publicly known to implement it for their models. It doesn't necessarily have to be the best technique of its class, but it has to work and be used

Get
Ṁ1,000
and
S1.00
Sort by:
reposted

Looking very likely imo with this paper
https://arxiv.org/abs/2407.15549

reposted

Reposting in the context of https://arxiv.org/abs/2403.05030