I currently think goal agnostic systems, particularly a subset of predictors, have really nice foundational properties that give us a path to practically usable extreme capability without autodoom.
Some (beefy) background:
FAQ: What the heck is goal agnosticism? — LessWrong
Using predictors in corrigible systems — LessWrong
Resolves yes if on January 1, 2025:
I still agree with the core arguments underlying goal agnosticism, how it can be used, and how it is likely to scale.
I still think that AI research is on a path that makes roughly goal agnostic foundations a reasonable expectation: not guaranteed, but >15%-ish chance. (Current estimate: ~87%)
Note that resolving yes does not require that I am still working on things related to goal agnosticism.
Some example ways this could resolve no:
An experiment shows that simple current-style autoregressive, single token predictive loss over a reasonably broad training distribution still allows unconditional preferences over world states. "Wanting to predict well" instead of "predicting well" leading to locally loss-increasing steganography, for example.
The industry finds an easier path to extreme capability that doesn't lend itself to goal agnosticism. For example, if someone manages to make end-to-end reinforcement learning on a sparse, distant reward (no predictive world model helping out, no reward shaping, etc) work reliably and for 10,000x less compute than an equivalent predictor-backed system, I'd probably be forced to downgrade the probability of goal agnostic systems a lot. Also, we'd probably explode.
I become convinced somehow that the fuzzier parts, like the degree to which we can reliably aim a strong system at useful things, are not like I thought in a way that makes the approach useless.