Texts generated by our civilization inherently possess a certain level of entropy, representing the amount of information and unpredictability within the text. The loss function of large language models measures the difference between predicted outputs and actual human-generated text. As models improve, this loss function decreases, but it cannot be reduced below the natural entropy of human text. According to the original scaling laws paper (https://arxiv.org/abs/2001.08361), it was speculated when and how this entropy level might be achieved, and this idea is taken seriously within the AI research community.
Resolution Criteria:
1. Evidence of Loss Function Plateau:
- There must be significant evidence that further progress in frontier large language models does not lead to a decrease in the loss function. This includes:
- Research papers and technical reports showing that improvements in model architecture, training data, and compute resources no longer yield significant reductions in loss.
- Analysis of loss function trends over time, indicating a plateau.
2. Consensus in Research Community:
- There must be at least moderate consensus in the AI research community that the plateau in the loss function is due to reaching the entropy level of existing human-generated texts. This consensus can be demonstrated by:
- Publications in peer-reviewed journals or conferences where multiple researchers or groups independently arrive at this conclusion.
- Statements or endorsements from (some) leading AI researchers or organizations acknowledging that the loss function has approached the theoretical minimum entropy of human text.
I'm pretty sure it's a limiting process, and the plateau stands at infinity in terms of size/scaling. However, with the exponentially increasing size of models, people might perceive the loss function as plateauing, or more likely as having sufficient diminishing returns as not to be worthwhile to scale further. In this case, am I right in my understanding that this would resolve the market NO? It's not an actual plateau, it's an apparent plateau. I expect experts to largely agree at the time.
@Bayesian reaching a plateau and reaching the entropy level are not the same, although one way to reach a plateau is to reach the entropy level. If the entropy level is reached, it will likely be more or less a sudden stop of loss function decreasing. Considering the current scaling laws, the entropy level will be achieved long before infinity. Indeed, in the original paper on scaling laws, it was speculated that it could be well within the computational capabilities of large companies in 2024, although the authors warned about these calculations being uncertain. Answering the question more concretely - if there will be just a plateau, and it will not look like the entropy level is achieved, and hence there will be no even moderate consensus that the entropy level is achieved, then this will not be the reason to resolve market YES, so if such state of affairs continues until the resolution date, the market will resolve NO.
@IhorKendiukhov I see. I must misunderstand what entropy level means. I thought it was the theoretical minimum in loss on new text from the same distribution; ie perfectly learning the distribution. I may read the paper later. Mb!
@IhorKendiukhov I mean, given that the generator of that text is a worldwide network of minds and computers and stuff, there's limits to how inherently compressible it can be; like some things will appear like noise to us, but really a lot of what will appear as noise will just be something the model could pick up with 10 orders of magnitude more scale and compute, I would think? If a part of the internet has [hash]: [key], the key part has very low entropy, but to reach that low level of entropy you need to solve encryption within your model's weights, which though theoretically possible is practically completely infeasible. That's a silly example of how the entropy level, being the theoretical lower bound, is very very very low for something as high dimensional as internet text. Which would mean we will definitely not reach it, merely perhaps go very close, or closer than our measurement can differentiate from actually reaching it. which would mean the level isn't reached and the market resolves NO. Curious where this reasoning is going wrong.
@Bayesian It is indeed the theoretical limit, but I think there are various opinions on when we can be “close enough” to and when it is achieved “statistically significantly”. The loss function is taken on average, and, for example, predicting hash-key pairs may just negligibly contribute to its dynamics. Or, to give a more extreme example. Imagine we receive a super smart message from a super advanced alien civilization, and this message is used in the training data. Formally, it is now part of “human text”. Then, several years later, we receive its continuation. We ask a LLM to predict it, and it can’t do it. And yet, this LLM may very well have an observed loss function that corresponds to the entropy of human text. Making the LLM capable of predicting this alien text may not influence the values of loss functions on randomly sampled pieces of human text other than infinitesimally. Or it is at least conceivable and does not seem to contradict anything. The opposite scenario is of course also conceivable. Generally, I think it is important to keep in mind that entropy is a statistical macro property when thinking about such things. Loss function may effectively stay the same even if there are actual meaningful changes in the capabilities. Consider another hyperbolic thought experiment: imagine there is a “God password”, a sequence short enough compared to the testing data that being able to predict it does not result in any noticeable changes in the loss function. But outputting this sequence opens Universe God Console where you can fulfill any of your wishes. This is a dramatic change in the capabilities, but not a change in the loss function. You may argue that, as per formula, this changes the loss function and, hence, entropy bound is lowered, which is formally true, but we are not going to observe this change in the loss function, in the same sense that we are not going to observe temperature change because several atoms started to move faster. And some may argue that, although new capabilities will or may keep emerging afterwards, statistically the entropy will be reached. At least I understand their position in such way. I think other types of arguments and of course counter arguments are possible here.
@IhorKendiukhov alright, makes sense I think. "statistically close enough that we can't noticeably beat it later, in terms of averages"? That would mean there would be no point at all (past that point) in a larger more pre-trained model, correct? After you reach the approximate asymptote, any new data is just noise, because all the structure that generates the data is modelled accurately. This seems just so far from what we have currently that I am very confused
@Bayesian Also, regarding this part: “Which would mean we will definitely not reach it, merely perhaps go very close, or closer than our measurement can differentiate from actually reaching it.” Since we can’t directly measure the difference between entropy and loss function (since we don’t know entropy), we can only try to measure when the loss function stops falling, and this I think is quite possible - it can’t follow scaling laws many more years - decrease must either decelerate or stop. Fundamentally, this does not allow to conclude that the entropy is reached of course, and there is huge space for all kinds of ideas.
@Bayesian Regarding training larger models - I don’t know. Seems that it should be like you said if it is the case. But as I said it is at least conceivable that capabilities may grow even if the loss function stays effectively the same.
@IhorKendiukhov If the loss function is as low as it can be relative to some given distribution of text, the only way to make it more capable is to make the loss function worse, right?
unless it's underpowered for measuring that change, in which case yeah "stays effectively the same" can be true. generally i really don't expect models to stop learning new structures that make them have lower loss, even after 3-5 more orders of magnitude. It will obviously be harder and harder to perceive the lowering loss, but at that scale it will be easier to measure really precisely whether the loss actually has lowered.
@Bayesian You can have of course a LLM with a higher loss but which is more capable in terms of the capabilities that we measure and care about, but yes, for a given LLM with a given loss that is being gradient descended, we must expect that it will gain capabilities only while decreasing loss, because it is going to try to decrease loss anyway.