End of pre-training era for language models: Will an LM fine-tune for more FLOPs than it is pre-trained for, before 2026

1kṀ1255

2026

44%

chance

ALL

The LM must have 'frontier' performance (having PILE/or similar perplexity above one year prior's SotA). The LM must have been trained after 2022.

If it's unclear whether this has happened, I will give this a year to resolve. If it remains plausibly unclear the market will resolve N/A.

Fine-tuning includes all RL training. Training on synthetic data, or additional supervised learning (which is deliberately trained on after training on a PILE-like generic dataset) counts as fine-tuning. If the nature of pre-training changes such that all SotA models do RL/instruction training/etc. during the initial imitation learning phase, I will probably resolve this question as ambiguous. Multi-modal training of text+image will by default count as pre-training.

Technical AI Timelines

AI Safety

Get

1,000

to start trading!

6 Comments

20 Holders

40 Trades

Sort by:

Quick back-of-the-envelope got me that LLAMA 3.1 used ~100B tokens of post-training compute.

https://scontent-lga3-2.xx.fbcdn.net/v/t39.2365-6/452387774_1036916434819166_4173978747091533306_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=t6egZJ8QdI4Q7kNvgGLd_WP&_nc_ht=scontent-lga3-2.xx&oh=00_AYAi3mcauEKuekcEn4CRpsF-igaR2I_3eBGde533LIM8eQ&oe=66A6EB8D#page=14.39

Couldn't find a particular FLOP or token count summarizing this (hence BOTEC), but I might have missed something.

predictedYES

For reference, current estimates of most expensive training run pre-2026 is at 10x GPT-4 price. https://www.metaculus.com/questions/17418/most-expensive-ai-training-run-by-year/

Meanwhile information on the data supply of text is available here https://epochai.org/trends#data-trends-section

Added some detail to clarify "Fine-tuning includes all RL training. Training on synthetic data, or additional supervised learning which is deliberately trained on separately from a PILE-like generic dataset counts as fine-tuning. If the nature of pre-training changes such that all SotA models do RL/instruction training/etc. during the initial imitation learning phase, I will probably resolve this question as ambiguous."

This does not really make sense to me given the purpose of pre-training in bulk knowledge learning, and fine-tuning in setting certain behaviours. In fact very little SFT data is required to get desired behaviour; certain papers such as LIMA have shown this.

Yann LeCun has a famous cake analogy that describes this.

predictedNO

Related questions

Related questions