How will people run LLaMa 3 405B locally by 2025?

Basic

Ṁ317

Jan 1

91%

Gaming GPUs + heavy quantization (e.g. 6x4090 @ Q2_0)

65%

Unified memory (e.g. Apple M4 Ultra)

60%

Tensor GPUs + modest quantization (e.g. 4xA100 2U rackmount)

60%

Distributed across clustered machines (e.g. Petals)

41%

Server CPU (e.g. AMD EPYC with 512TB DDR5)

"Cloud" is a boring answer. User base of interest is somewhere between hobbyists with a budget and companies with a couple of self-hosted racks.

This question is managed and resolved by Manifold.

#AI

#️ Technology

Get

1,000

and

3.00

5 Comments

10 Holders

17 Trades

Sort by:

News:

people eyeballing Mac Studio Thunderbolt clusters on X
yet more progress on clustering in llama.cpp

My bet is locally on Apple CPU / GPU (by whatever name called).
And since this will still be expensive, the rest will run in a datacenter on server class GPU/inference chips (not sure what those look like as yet).

* Apple will find a way to compress/store weights on firmware such that you can work with say 64Gb RAM.

@VishalDoshi Have there been any "texture compression" decoders for LLM weights prototyped?

bought Ṁ10 NO

How many answers are you picking? I’m sure someone somewhere will do each of there

@MingweiSamuel Whatever looks Pareto-dominant based on vibes from Twitter, /g/, and r/LocalLLaMa. For example current 70B meta looks like multiple gaming GPUs or Apple unified memory with very rare DIY-adaptered A100 frankenracks.

If community doesn't settle on viable 405B solutions by EoY everything gets a NO.

Related questions

Related questions