Will there be a more sample-efficient pretraining algorithm than next token prediction for NLP before 2027?

1kṀ623

2027

43%

chance

ALL

Will a pretraining algorithm for language models which meaningfully improves on the sample efficiency of next token prediction be widely known before 2027?

Some details:

The technique must involve self-supervised learning on unlabeled data
The technique must have documented scaling properties which meaningfully outperform next token prediction in test perplexity with respect to data, for whichever model architectures are popular by 2027
- It's fine if there are tradeoffs with compute efficiency
- It's fine if next token prediction outperforms the new technique early in training, or for small training runs, as long as scaling trends predict that the new technique would be better on runs using at least 10^26 FLOP and 15T tokens (roughly the budget of Llama 3 400B)
It must be accepted within the ML community that the technique is broadly superior to next token prediction (even if there are some tradeoffs) and has the potential to scale to outperform the best prior models trained using next token prediction
To validate the scaling potential of the method, it must be used to train a model which qualitatively matches or exceeds GPT-4 (if the above conditions hold before 2027, I will wait until July 2027 for such a model and will resolve YES if one is produced)

Technical AI Timelines

AI Safety

Scaling Laws

Get

1,000

to start trading!

People are also trading

Will OpenAI release a tokenizer with more than 210000 tokens before 2026?

24% chance

End of pre-training era for language models: Will an LM fine-tune for more FLOPs than it is pre-trained for, before 2026

44% chance

Will there be a state-of-the-art LLM that is NOT based on next raw token prediction before 2029?

50% chance

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2025?

45% chance

By the start of 2026, will I still think that transformers are the main architecture for tasks related to natural language processing?

75% chance

Will the next major LLM by OpenAI use a new tokenizer?

77% chance

Will anyone train a TokenFormer model at scale before 2026?

25% chance

Will Sparse Autoencoders be successfully used on a downstream task in the next year and beat baselines?

40% chance

Will OpenAI release a tokenizer with vocab size > 150k by end of 2024?

42% chance

Will there be an LLM (as good as GPT-4) that was trained with 1/100th the energy consumed to train GPT-4, by 2026?

Sort by:

Define token. What about latent space tokens?

@JohnCarpenter Tokens are discretizations of data, generally text but possibly other modalities. In autoregressive language modeling, the goal is to produce probability distributions over token sequences. Currently, most tokenizers produce 0.75 tokens per word, but this might change. I’m not referring to latent space tokens. When I mention “15T” tokens in the resolution criteria, I’ll specify that this refers to an equivalent amount of unlabeled training data, even if tokenization methods change.

bought Ṁ30 YES

What about latent diffusion language models?

@RemNi My understanding of the method is that an autoencoder is pretrained and diffusion is used to model the latent space. I’m not familiar with how the technique can be used for autoregressive language modeling, but if there’s a way that works and its scaling is borne out in the manner required by the question, then it would count.