In "Moving the Eiffel Tower to ROME", a paper claimed to have identified where the fact "The Eiffel Tower is in France" had been stored within GPT-J-6B, in the sense that you could poke the GPT there and make it believe the Eiffel Tower was in Rome.
"The Eiffel Tower is in France" seems (in my personal judgment) like the sort of fact that early AI pioneers could and did represent within GOFAI systems. GPT-J probably does more with that fact - it can for example answer how to get to the Eiffel Tower from Berlin, believing that the Eiffel Tower is in Rome. But the paper didn't offer neural transparency into how GPT-J gives directions, we don't know the stored patterns for answering that part - just a neural representation of the brute idea that GOFAI pioneers might've represented with in(Eiffel-Tower, Rome).
This market reflects the probability that, in the personal judgment of Eliezer Yudkowsky, anyone will have uncovered any sort of data, pattern, cognitive representation, within a text transformer / large language model (LLM), whose semantic pattern and nature wasn't familiar to AI and cognitive science in 2006 (to pick an arbitrary threshold for "before the rise of deep learning").
Also in 2006, somebody might've represented "the Eiffel Tower is in France" by assigning spatial coordinates to the Eiffel Tower and a regional boundary to France. Idioms like that appear in eg video games long predating 2006. Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model, so far as I know; but even if someone did so before the end of 2026 - as mighty a triumph as that would be - it would not (in the personal judgment of Eliezer Yudkowsky) be an instance of somebody finding a cognitive pattern represented inside a text transformer, which pattern was unknown to cognitive science in 2006.
2006 similarly knew about linear regression, k-nearest-neighbor, principle components analysis, etcetera, even though these patterns were considered "statistical learning" rather than "Good-Old-Fashioned AI". Identifying an emergent kNN algorithm inside an LLM would again not constitute "understanding via transparency, within an LLM, some pattern and representation of cognition not known in 2006 or earlier". Likewise for TD-learning and other biologically inspired algorithms, including those considered the domain of neuroscience (from 2006 or earlier).
GOFAI and kNN and similar technologies did not suffice to, say, invent new funny jokes, or carry on a realistic conversation, or do any sort of intellectual labor. The intent of this proposition, if relevant, is to assert that by end of 2026 we will not be able to grasp any inkling of the cognition inside of LLMs by which they do much more than AIs could do in 2006; we will not have decoded any cognitive representations inside of LLMs supporting any cognitive capabilities original to the era of deep learning. We will only be able to hunt down internal cognition of the sort that lets LLMs do more trivial and old-AI-ish cognitive steps, like localizing the Eiffel Tower to France (or Rome); on their way to completing larger and more impressive tasks, incorporating other cognitive steps; whose representations inside the LLM, even if we have some idea of which weights are involved, have not yet been decoded in a way semantically meaningful to a human.
@AlexMizrahi This market is not about identifying semantics inside an LLM. It is not about identifying semantics inside an LLM which we did not previously know to be inside LLMs. This market is about identifying semantics inside an LLM such that those semantics, once uncovered, teach us something about semantic representations in general which we did not know in 2006.
@MartinRandall What matters isn't the opacity of the discovering program, but whether the discovered result is semantically transparent to us.
Just like how everything is matrix multiplication plus a nonlinearity, what if at a higher level everything is made of like a dozen things all of which are known? Then the question would resolve only based on how those things are organized. But then I could imagine those organizing systems also being from the same dozen components.
I'm on the fence about how likely this is but this is roughly my thought process
New anthropic paper dropped
https://www.anthropic.com/news/mapping-mind-language-model
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
I see this question as a sort of dual to increase of capabilities that shocked the world with each new iteration of GPTs. With the progress of interpretability, it would be surprising to me if we weren't able to elucidate some of the deep reasons why they perform so well, and en passant, discovered something new and strange about linguistics and fourty other things.
Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.
How would you resolve this market if researchers distil causal models from an LLM that are much better than causal models constructed by older means? Would it make a difference if they had, for example, a modestly different notion of “intervention” to existing causal models?
My guesses are “no” and “maybe” in that order
@DavidJohnston I think if we learn a different notion or more compact representation for interventions off studying LLMs, that definitely counts. In the former case I think I want to know more about "better"; if we just distilled knowledge in a known format that LLMs learned by inscrutable means, we have not found and understood a new algorithm.
Semantics are hard. And I think as we poke and prod LxMs, we will learn much more about how to update our thinking about how the human brain functions in re: semantics and linguistics. But I think the time horizon is farther out than 2026, because I don't think there is enough interpretibility between shape rotators and wordcels yet.
By the late 2020s,we might be able to decode something meaningful from current models (with the help of later models) - but the laser models may still be out of reach.
@EliezerYudkowsky Selling all shares in this market to avoid any appearance of conflict of interest in judging it.
Re: "Nobody has yet identified emergent environmental-spatial-coordinate representations inside a text transformer model"
Another step: How do Language Models Bind Entities in Context?
Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.
@stuhlmueller afaict from the introduction, "binding" is sth GOFAI already did, e.g. via expressions like "lives(Alice,Paris)". So this is more a step towards finding all of GOFAI again (which might also help find novel insights!) than a direct step towards discovering novel cognitive algorithms.
So after working on interpretability for some time I've updated on favour of my impresion of interpretability progress going to be very fast in the next few years.
I think Eliezer will be predictably surprised one or two more times untill he updates and starts expecting faster progress.
The thing is that turns out mechanistic interpretability is not extremately hard(it is hard in an absolute semse but just a normal for a typical field of science hard) , and the lack of progress untill now is mostly due that except a few people like Chris Olah and his team it was basically not tried for years untill recently and a lot of the simplest ideas like using autoencoders to detect features just haven't been tried and tend to just work.
But fortunately this is changing and seems likely to change further, ML academics are likely to do more work on the field now that Anthropic Deepmind and things like the Rome paper are making it trendy, and Neel Nanda has siglelhandly gotten lots of people interested in the field.
It also seems like some unkowns about things like how models represent data might just resolve in the most favorable way where it's mostly linear representations and the features as directions.
(at least for transformers)
Now unfortunately I think this doesn't necesarily mean we will have good enough interpretability to know what we are doing on time, since that's a high bar, but understanding any specific subtasks that couldn't be done with GOFAI seems an easier subproblem that seems likely to be solved anyway.
I do get why it's a harder problem than understanding some already known algoritm, but new complicated stuff is made of smaller subcomponets we can undertand and it seems likely we'll have usefull advances in automated interpretability during 2024-2025.
@VictorLevoso sure, but would they find something interesting and insightful to make this market resolve Yes?
@ICRainbow I think that it depends partly on like what counts for the market and unknowns about how LLM actually work.
But models seem to be able to do stuff like explaining jokes that couldn't be easily be done with GOFAI that are relatively narrow task one could focus on and expend some time identifying the parts of the model involved and understanding it piece by piece and getting a lot of feedback, leading to people figuring out how the whole thing works.
Like duno it just seems like the kind of thing I expect to see in fe an Anthropic paper by 2026.
Not with more than the current probability of the market, but I'm mostly explaining why I've kept buying yes all this time and been the top yes bettor here, since my reasons have changed over time compared to my old comments, and some other yes coments seem like they are updating on the wrong reasons.
@VictorLevoso also if it is the case that if we got to the point were we completely undertand LLM and how they do all the things GOFAI couldn't do and the market wouldn't resolve yes then I feel like something was wrong with the market resolution criterion with respect of the spirit of the question, and Eliezer's notion of patterns of cognition not known in 2006 is probably just not coherent.
I feel like to the extent the question makes sense at all there is something in there that counts as novel enough.
Interesting alleged 'unlearning' capability, could this potentially lead down some promising paths towards internal patterns?