Will Llama 3-multimodal be natively mixed-multimodal? (VQ-VAE+next token prediction)
Plus
2
Ṁ200Jan 1
50%
chance
1D
1W
1M
ALL
Vision Language Models currently has two common paradigms
The first one is LLaVA, where one assembles a CLIP-like vision block with a LLM through projection.
The second approach is Gemini/LVM, where one uses a VQ-VAE to compress pictures into discrete tokens, then simply do autoregressive next token prediction. It is suspected that GPT-4o is also trained this way, which explains why it can generate images with excellent text rendering.
Note that meta has just announced Chameleon: Mixed-Modal Early-Fusion Foundation Models
Will Llama 3 multi-modal or Llama 3 vision be trained in the second approach?
This question is managed and resolved by Manifold.
Get
1,000
and3.00
Related questions
Related questions
Will Llama 4 use mixture of experts?
64% chance
Top 3 Multimodal Vision2Language Model by EOY 2024? (by Organization/Company)
By 2030 will we have video-to-video where an LLM can continue any video prompt in any way you like?
76% chance
Will a Mamba 7b model trained on 2 trillion tokens outperform Llama2-13B
66% chance
Will OpenAI announce a multi-modal AI capable of any input-output modality combination by end of 2025? ($1000M subsidy)
85% chance
Will Llama 4 be the best LLM in the chatbot arena?
16% chance
Will Llama-3 (or next open Meta model) be obviously good in its first-order effects on the world?
88% chance
Will OpenAI's next major LLM release support video input?
55% chance
Will a SOTA open-sourced LLM forecasting system make major use of quasilinguistic neural reps (QNRs) before 2027?
19% chance