This resolves YES if: There is a contiguous period of 12 months during which no new language model is credibly known to exist that sets a new record for most compute used during the entire training process from start to finish.
The definition of LM is intended to include models that use other modalities or do other things in addition to language (images, RL, etc).
This specifically does not take into account algorithmic innovations. A 10x effective compute improvement from better algorithms/utilization does not count as a 10x compute usage increase. This includes low level optimizations, or innovations that use lower precision (i.e I consider 2 FP16 FLOPs as equivalent to 1 FP32 FLOP)
This market is conditional on it being generally understood that SOTA LMs are still being publicized, and their compute usages are at least roughly estimatable (except for i.e military). Compute usages don't have to be exact or official as long as they are credibly estimatable from public information (i.e looking at power consumption, financial reports, satellite imagery of datacenters). This market resolves N/A if compute numbers stop being estimatable in such a way that it becomes controversial as to whether models are continually using more compute.
A fine tune of an existing LM counts the compute usage of the base LM plus the fine tuning compute; however, to qualify for this market it has to use at least 50% new compute over the last LM that qualified for this market; this is intended to exclude a SOTA LM being continually fine tuned on new data with trivial amounts of compute from technically continually setting new SOTAs on this market.
As a sanity check, the new LM should not be substantially worse than previous compute-SOTA models on most major benchmarks where the models are fairly comparable. This is intended to exclude models which are trained with much more inefficient techniques/poorly chosen hparams that waste lots of the compute.
Using a point-estimate for Minerva, Metaculus has ~80% that GPT-4 used more compute than Minerva https://www.metaculus.com/questions/13787/petaflops-during-gpt-4-training/
@JacobPfau Is there an estimate with more specific details on how the specific number was arrived at? I think as presented this is insufficient to meet the "credibly estimatable" bar. (I will probably keep the market open for a few months for estimates to trickle in; i.e I will not resolve N/A immediately after 12 months have passed since PaLM)
@LeoGao even if there were credible estimates, without knowing the training data we do not know if the models were trained on the benchmarks, making it impossible to validate the last bullet point
@dmayhem93 The intent of this market is not to hinge on the particular technicality of whether exact/official numbers are released. In the event that the resolution hinges on the nonexistence of public estimates, this market is to resolve N/A, not YES.
This market resolves N/A if compute numbers stop being estimatable in such a way that it becomes controversial as to whether models are continually using more compute.
@LeoGao FYI, I interpreted this as being an assertion about industry tends as a whole and not a single model.
@L Also worth clarifying that this market is conditional on the world not having ended or transitioned into a completely unrecognizable galaxy brain post-scarcity waterfowl based society by then. Plus this doesn't specify that it's an exponential, even if models only get 10% more compute-hungry each year, it still counts as long as people keep releasing them frequently.
(Also, might be worth clarifying what I mean by conditional on the world not having ended: I mean specifically looking only at the time period before the world having ended; so if there is a new SOTA every 5 months until the world ends in 2027 then this would resolve NO rather than N/A. However, it's not clear this matters because nobody will be around to resolve it either way if this does happen...)
@JacobPfau Figure 3 here suggests there was a 28 month-ish plateau following AlphaGo Zero