Meta/Mixture of Experts

Llama 4 Scout 17B-16E

chatcodingmultilingualvisionDistilled
109B
Parameters (17B active)
512K
Context length
16
Benchmarks
4
Quantizations
200K
HF downloads
Architecture
MoE
Released
2025-04-05
Layers
48
KV Heads
8
Head Dim
128
Family
llama

Model Information

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.

These Llama 4 models mark the beginning of a new era for the Llama ecosystem. We are launching two efficient models in the Llama 4 series, Llama 4 Scout, a 17 billion parameter model with 16 experts, and Llama 4 Maverick, a 17 billion parameter model with 128 experts.

Model developer: Meta

Model Architecture: The Llama 4 models are auto-regressive language models that use a mixture-of-experts (MoE) architecture and incorporate early fusion for native multimodality.

<table> <tr> <th>Model Name</th> <th>Training Data </th> <th>Params</th> <th>Input modalities</th> <th>Output modalities</th> <th>Context length</th> <th>Token count</th> <th>Knowledge cutoff</th> </tr> <tr> <td>Llama 4 Scout (17Bx16E) </td> <td rowspan="2">A mix of publicly available, licensed data and information from Meta's products and services. This includes publicly shared posts from Instagram and Facebook and people's interactions with Meta AI. Learn more in our <a href="https://www.facebook.com/privacy/guide/genai/">Privacy Center</a>. </td> <td>17B (Activated) 109B (Total) </td> <td>Multilingual text and image</td> <td>Multilingual text and code</td> <td>10M</td> <td>~40T</td> <td>August 2024</td> </tr> <tr> <td>Llama 4 Maverick (17Bx128E)</td> <td>17B (Activated) 400B (Total) </td> <td>Multilingual text and image</td> <td>Multilingual text and code</td> <td>1M</td> <td>~22T</td> <td>August 2024</td> </tr> </table>

Supported languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.

Model Release Date: April 5, 2025

Status: This is a static model trained on an offline dataset. Future versions of the tuned models may be released as we improve model behavior with community feedback.

License: A custom commercial license, the Llama 4 Community License Agreement, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE

Where to send questions or comments about the model: Instructions on how to provide feedback or comments on the model can be found in the Llama README. For more technical information about generation parameters and recipes for how to use Llama 4 in applications, please go here.

Intended Use

Intended Use Cases: Llama 4 is intended for commercial and research use in multiple languages. Instruction tuned models are intended for assistant-like chat and visual reasoning tasks, whereas pretrained models can be adapted for natural language generation. For vision, Llama 4 models are also optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The Llama 4 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 4 Community License allows for these use cases.

Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 4 Community License. Use in languages or capabilities beyond those explicitly referenced as supported in this model card**.

**Note:

1. Llama 4 has been trained on a broader collection of languages than the 12 supported languages (pre-training includes 200 total languages). Developers may fine-tune Llama 4 models for languages beyond the 12 supported languages provided they comply with the Llama 4 Community License and the Acceptable Use Policy. Developers are responsible for ensuring that their use of Llama 4 in additional languages is done in a safe and responsible manner.

2. Llama 4 has been tested for image understanding up to 5 input images. If leveraging additional image understanding capabilities beyond this, Developers are responsible for ensuring that their deployments are mitigated for risks and should perform additional testing and tuning tailored to their specific applications.

How to use with transformers

Please, make sure you have transformers v4.51.0 installed, or upgrade using pip install -U transformers.

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

Hardware and Software

Training Factors: We used custom training libraries, Meta's custom built GPU clusters, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.

Training Energy Use: Model pre-training utilized a cumulative of 7.38M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.

Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 1,999 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with clean and renewable energy; therefore, the total market-based greenhouse gas emissions for training were 0 tons CO2eq.

Model NameTraining Time (GPU hours)Training Power Consumption (W)Training Location-Based Greenhouse Gas Emissions (tons CO2eq)Training Market-Based Greenhouse Gas Emissions (tons CO2eq)
Llama 4 Scout5.0M7001,3540
Llama 4 Maverick2.38M7006450
Total7.38M-1,9990

The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.

Training Data

...

Quantizations & VRAM

Q4_K_M4.5 bpw
61.8 GB
VRAM required
94%
Quality
Q6_K6.5 bpw
89.0 GB
VRAM required
97%
Quality
Q8_08 bpw
109.5 GB
VRAM required
100%
Quality
FP1616 bpw
218.5 GB
VRAM required
100%
Quality

Benchmarks (16)

Arena Elo1491
MATH-50084.4
MMLU-PRO74.3
MMMU73.4
GPQA Diamond58.7
GPQA57.2
IFEval54.8
BBH51.4
LiveCodeBench29.9
MATH21.8
MUSR20.8
AIME14.0
AA Math14.0
AA Intelligence13.5
AA Coding6.7
HLE4.3

Run with Ollama

$ollama run llama4-scout:17b-16e

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

Apple M1 Ultra (64GB)
64 GB VRAM • 800 GB/s
APPLE
$2499
Apple M2 Ultra (64GB)
64 GB VRAM • 800 GB/s
APPLE
$2999
Apple M4 Max (64GB)
64 GB VRAM • 546 GB/s
APPLE
$2899
Apple M2 Max (64GB)
64 GB VRAM • 400 GB/s
APPLE
$2299
Apple M3 Max (64GB)
64 GB VRAM • 300 GB/s
APPLE
$2799
Apple M4 Pro (64GB)
64 GB VRAM • 273 GB/s
APPLE
$2599
AMD Radeon Instinct MI200
64 GB VRAM • 1640 GB/s
AMD
AMD Radeon Instinct MI210
64 GB VRAM • 1640 GB/s
AMD
NVIDIA H100 SXM5 64 GB
64 GB VRAM • 2020 GB/s
NVIDIA
NVIDIA Jetson AGX Orin 64 GB
64 GB VRAM • 205 GB/s
NVIDIA
NVIDIA Jetson T4000
64 GB VRAM • 273 GB/s
NVIDIA
NVIDIA RTX PRO 5000 72 GB Blackwell
72 GB VRAM • 1340 GB/s
NVIDIA
NVIDIA H100 SXM5 80GB
80 GB VRAM • 3350 GB/s
NVIDIA
$25000
NVIDIA H100 PCIe 80GB
80 GB VRAM • 2000 GB/s
NVIDIA
$25000
NVIDIA A100 SXM 80GB
80 GB VRAM • 2039 GB/s
NVIDIA
$10000
NVIDIA A100 PCIe 80GB
80 GB VRAM • 1935 GB/s
NVIDIA
$10000
NVIDIA A100 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A100 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A100X
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 SXM5 80 GB
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA H100 CNX
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A800 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A800 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 SXM5
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA RTX 6000D
84 GB VRAM • 1570 GB/s
NVIDIA
NVIDIA B200
90 GB VRAM • 4100 GB/s
NVIDIA
NVIDIA H100 NVL 94 GB
94 GB VRAM • 3940 GB/s
NVIDIA
NVIDIA H100 SXM5 94 GB
94 GB VRAM • 3360 GB/s
NVIDIA

Find the best GPU for Llama 4 Scout 17B-16E

Build Hardware for Llama 4 Scout 17B-16E