Will Sparse Autoencoders be successfully used on a downstream task in the next year and beat baselines?

105

10kṀ48k

Sep 1

40%

chance

ALL

Resolution Criteria: This question will be resolved according to my personal judgement. I think about this as follows:

Downstream means the task must be one that I think is interesting/non-spurious (ideally one that non-interpretability people cared about beforehand), and where success is objectively measurable (but can be qualitative rather than quantitative). Being real-world useful is a significant plus, but not necessary. Ultimately this will be my judgement call.
- You’re allowed to include rules constraining how the task can be solved if there’s a good argument for how this simulates a realistic situation. Eg with unlearning, you could forbid “just don’t train on the data with the bad concept”, to simulate the real world where we have imperfect data labels (but then, eg, excluding 50% of the bad data should be permissible)
- Given a constraint like “within a certain compute budget”, I’ll count 10% of the training compute of the SAE as needing to come from that budget (to simulate it being amortised over many downstream use cases)
- A task can be chosen post-hoc by the authors or by me. If I believe a paper provides enough evidence of victory on a post-hoc chosen downstream task, but doesn’t explicitly argue this in the paper, I’ll reach out to the authors and use my best judgement.
SAEs must have been compared to appropriate baselines in a fair fight (probing, steering vectors, prompting, finetuning, adversarial example generation (a la GCG) etc)
SAEs must beat the baselines, not just be competitive. I'll qualitatively judge "do I think the sample size was large enough that the effect was real rather than noise"
I must believe that SAEs were an important part of the solution (i.e. that the same solution wouldn't work without the SAE)
I'll allow other dictionary learning techniques that are not SAEs but try to find a sparse, interpretable decomposition of model activations
The work must be public (e.g. I won't resolve this on private results from my team or gossip from other labs). I'll allow missing details to be clarified in private communication, so long as the key result is public.
I'll qualitatively evaluate how cherry-picked/brittle the results seem. For example, if an SAE is great for steering if the desired concept has a corresponding latent, but only 10% of concepts thought of have corresponding latents, I'm not sure if I'd count that. But if one of those 10% of concepts was a big deal (eg refusal steering) I might count that

This is very subjective, so I will not trade in this market myself.

For context, here are three of the best contenders so far for doing something useful on a downstream task with model internals, and why I don't think they count:

SHIFT in Sam Marks' Sparse Feature Circuits - An SAE circuit was found feeding into a probe that had picked up on a spurious correlation with gender, and ablating key gender tracking features reduced the spurious correlation. IMO this is the closest attempt so far, and they compare reasonably to a range of baselines, but I think falls a bit short, as the task is a bit too contrived/spurious for my tastes. I think a similar thing in a more complex/useful setting could be sufficient though.
Golden Gate Claude - while very fun, they didn't compare to the baselines of a system prompt or steering vector (and I am unconvinced GGC would win). I would have considered a user study on eg "how fun is the system to use" where GGC won over those baselines to be successful resolution. It’s fine if “how fun is the system to use” was chosen after creating GGC, so long as it was chosen before the user study (no p-hacking!) - I want an existence proof of SAEs being useful.
Refusal is mediated by a single direction - a paper where we found a refusal steering vector and ablated it to jailbreak the model. In addition to not being on SAEs (and so instantly disqualified), we only found that it was competitive with fine-tuning, not better. It might have won on jailbreaking at a given compute budget (or at least, be outside the Pareto frontier of jailbreaking successes against compute budgets)

Technology

LLMs

Mechanistic interpretability

Machine Learning

Get

1,000

to start trading!

People are also trading

Will OpenAI models achieve ≥90% on SimpleBench by the end of 2025?

41% chance

By EOY 2025, will the model with the lowest perplexity on Common Crawl will not be based on transformers?

10% chance

Will there be a more sample-efficient pretraining algorithm than next token prediction for NLP before 2027?

43% chance

At the end of 2025 will a DINO-based algorithm still be SOTA for self-supervised learning in vision?

57% chance

Will OpenAI's next major LLM (after GPT-4) achieve over 50% resolution rate on the SWE-bench benchmark?

79% chance

Will it be possible to disentangle most of the features learned by a model comparable to GPT-2 this decade?

84% chance

Will future large video model (understanding) use pixel loss or embedding loss?

Will it be possible to disentangle most of the features learned by a model comparable to GPT-4 this decade?

37% chance

By EOY 2026, will it seem as if deep learning hit a wall by EOY 2025?

17% chance

Will Academic Capture Become a Topic of 5X Interest Than Baseline by Year End 2033?

Sort by:

bought Ṁ2,400 NO

Relevant update from the team the market creator works on: https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/

@CharlesFoster

banger

bought Ṁ200 NO

Since Sieve hasn't resolved this to yes, I am buying no, since I don't think anything more compelling than that will be found.

sold Ṁ340 YES

We had that recent paper that argued that sparse auto encoders are not particularly good when compared to steering vectors / representation engineering - still not sure what we are supposed to take from this though - I'm skeptical on the uselessness of sparse auto encoders - but maybe the reconstruction loss is just too high / not high enough fidelity - such that representation engineering is just strictly better - but don't we still want to consider SAE's for quantization?

https://arxiv.org/pdf/2502.16681

My gut is that the death of SAEs is overstated but the market seems not to have updated on the paper - so I've sold.

@CampbellHutcheson Even if SAEs manage to beat RepE (pretty skeptical about this already), they are never beating prompting + fine-tuning. See this:

"SAEs must have been compared to appropriate baselines in a fair fight (probing, steering vectors, prompting, finetuning, adversarial example generation (a la GCG) etc)"

Does this resolve to YES? https://www.tilderesearch.com/blog/sieve

bought Ṁ50 YES

I bet yes because:

A year is a long time for things to happen.
The market resolves on Neel's judgement

https://openreview.net/forum?id=FVItLat5ii&noteId=sCBMB7O3qw

apparently SAEs aren't necessarily improving upon simple K-means

@Jono3h Wild, I hadn't seen that paper, it's from Sept 2023!

I don't really update from it though - it was done super early, before we knew much about training good SAEs, it's on CNNs, I don't understand or trust their metrics (and neither do the reviewers as far as I can tell), and they don't really discuss how they train the SAEs and it's a bit of a side thing in the paper, so I'm not at all confident they did it competently enough to learn much from the results.

@NeelNanda Ah I assumed that SAEs on such small models are easy enough to train that that wouldn't be a worry.

I did like the idea of quantifying the quality of the features by applying similarity metrics to the top N maxiximally exciting images (MEIs) for those features.
Intuitively I agree with the authors that you would expect the MEIs for good features to be similar to each other and LPIPS, which the authors use, seems like a well-tested and human-friendly image similarity metric.

I saw that Mekelov, Lange and you have a suggestion for principled evaluations that afaiu involves comparing manually-made dictionaries with SAEs. But I didn't grasp the methodology within 30min.