We are going to start running out of data to train large language models in [YEAR]

780Ṁ336

2030

2023

13%

2024

26%

2025

18%

2026

2027

2028

2029

14%

2030

Will resolve to N/A if the condition fails i.e. No such shortage of data ends up taking place.

Get

1,000

to start trading!

People are also trading

The gap will close between the quality of open source language models and Google's internal language models in 2 years

29% chance

Will a Large Language Model save a human life through medical advice by the end of 2025?

90% chance

Will a Large Language Model be listed as an author on a peer-reviewed paper by the end of 2025?

30% chance

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2025?

45% chance

Will Scaling Laws for Neural Language Model continue to hold till the end of 2026?

82% chance

By 2030, will large language models still be at the peak of AI? [DRAFT]

25% chance

End of pre-training era for language models: Will an LM fine-tune for more FLOPs than it is pre-trained for, before 2026

44% chance

Will any language model trained without large number arithmetic be able to generalize to large number arithmetic by 2026?

51% chance

How big will Mistral's known largest language model be? (2024)

Will Scaling Laws for Neural Language Model continue to hold till the end of 2027?

Sort by:

Models will never run out of data to train on. The key distinction here is organic vs. synthetic data; we can always produce more of the latter but are quite limited (from a long-term perspective) by the former.

Please note that this question is specifically about large language models.

Can you elaborate on what you mean by “run out” and “shortage”?

@BTE Intuitively:

If you imagine training as a function of model parameters (P), architecture efficiency (E), number of unique data points (N_U) , number of augmented data points (N_A) , Computational capacity (C), storage (S) etc (to name the major ones) then

training = f(P, E, N, C, S)

As P increases, we should increase the NU+NA=N datapoints proportionally to ensure that we don't end up with P >> N which might lead to overfitting (Similarly, C and S have a similar relationship to N as well).

Now, usually when people say "run out of data", the implication is that with an increase in parameters P, there might not be a proportional increase in datapoints N for optimal training, which leads to a diminished training efficiency. That is what I'm going with for now.

@firstuserhere Do we know the rate at which new data is generated?

@firstuserhere Also is this about all modalities or just text?

@BTE Just text generative models. However, it is possible for multiple modality-input --> single modality (text) output models to be the standard norm.

@firstuserhere And how does synthetic or augmented data factor? Like how many times can you augment a dataset before it loses its value as grounded in truth?

@BTE Yes, that's fine. People speculate that augmenting data doesn't help beyond a few iterations and has diminishing returns and the same argument is given for synthetic data.

I don't personally buy it for Synthetic data, especially with models of the form (multi modal -> text) generating synthetic text data. So, I expect that as long as synthetic data isn't bad for model training, this is going to be juiced as much as possible. If not, we might start running out of data :)

@firstuserhere I think this is an overlooked point usually. You could have a model of the form

text + audio + image -> text

and this text is possibly "richer" in quality than just synthetic data taken from a model of the form

text -> text

@firstuserhere Maybe. It all gets converted to tokens so I guess it depends whether there is added value in having pixel tokens and text tokens (both numbers once tokenized) used to generate new text. It’s an interesting idea.

@BTE Agreed, which is why I think a bigger tokenizer is almost certainly better for performance. Although the following questions are not very useful because they ask for OpenAI's actions (i.e. whether they will release it instead of using it), they're still worthwhile

@firstuserhere So if the rate of increase in the number of parameters falls to keep the number of parameters proportional to the amount of data, then you wouldn't say we're running out of data?