(5000M subsidy) Will I (porby) think "goal agnosticism" as a concept is still relevant/useful at the end of 2024?

Plus

Ṁ1423

Jan 1

66%

chance

ALL

I currently think goal agnostic systems, particularly a subset of predictors, have really nice foundational properties that give us a path to practically usable extreme capability without autodoom.

Some (beefy) background:
FAQ: What the heck is goal agnosticism? — LessWrong

Using predictors in corrigible systems — LessWrong

Resolves yes if on January 1, 2025:

I still agree with the core arguments underlying goal agnosticism, how it can be used, and how it is likely to scale.
I still think that AI research is on a path that makes roughly goal agnostic foundations a reasonable expectation: not guaranteed, but >15%-ish chance. (Current estimate: ~87%)

Note that resolving yes does not require that I am still working on things related to goal agnosticism.

Some example ways this could resolve no:

An experiment shows that simple current-style autoregressive, single token predictive loss over a reasonably broad training distribution still allows unconditional preferences over world states. "Wanting to predict well" instead of "predicting well" leading to locally loss-increasing steganography, for example.
The industry finds an easier path to extreme capability that doesn't lend itself to goal agnosticism. For example, if someone manages to make end-to-end reinforcement learning on a sparse, distant reward (no predictive world model helping out, no reward shaping, etc) work reliably and for 10,000x less compute than an equivalent predictor-backed system, I'd probably be forced to downgrade the probability of goal agnostic systems a lot. Also, we'd probably explode.
I become convinced somehow that the fuzzier parts, like the degree to which we can reliably aim a strong system at useful things, are not like I thought in a way that makes the approach useless.

This question is managed and resolved by Manifold.

#AI Safety

#️ AI Alignment

#Technical AI Safety

Get

1,000

and

3.00

1 Comment

11 Holders

13 Trades

Sort by:

@Bird any updates?

Related questions

Related questions