Meta/Dense

Llama-3.2-11B-Vision

chatvisionThinkingDistilled

10.7B

Parameters

128K

Context length

Benchmarks

Quantizations

Architecture

Dense

Released

2024-09-25

Layers

KV Heads

Head Dim

128

Family

llama

Model Card

View on HuggingFace

Model Information

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Model Developer: Meta

Model Architecture: Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.

	Training Data	Params	Input modalities	Output modalities	Context length	GQA	Data volume	Knowledge cutoff
Llama 3.2-Vision	(Image, text) pairs	11B (10.6)	Text + Image	Text	128k	Yes	6B (image, text) pairs	December 2023
Llama 3.2-Vision	(Image, text) pairs	90B (88.8)	Text + Image	Text	128k	Yes	6B (image, text) pairs	December 2023

Supported Languages: For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.

Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.

Llama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.

Model Release Date: Sept 25, 2024

Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.

License: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).

Feedback: Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model README. For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go here.

Intended Use

Intended Use Cases: Llama 3.2-Vision is intended for commercial and research use. Instruction tuned models are intended for visual recognition, image reasoning, captioning, and assistant-like chat with images, whereas pretrained models can be adapted for a variety of image reasoning tasks. Additionally, because of Llama 3.2-Vision’s ability to take images and text as inputs, additional use cases could include:

Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
Document Visual Question Answering (DocVQA): Imagine a computer understanding both the text and layout of a document, like a map or contract, and then answering questions about it directly from the image.
Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
Image-Text Retrieval: Image-text retrieval is like a matchmaker for images and their descriptions. Similar to a search engine but one that understands both pictures and words.
Visual Grounding: Visual grounding is like connecting the dots between what we see and say. It’s about understanding how language references specific parts of an image, allowing AI models to pinpoint objects or regions based on natural language descriptions.

The Llama 3.2 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.2 Community License allows for these use cases.

Out of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.

How to use

This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original llama codebase.

Use with transformers

Starting with transformers >= 4.45.0 onward, you can run inference using conversational messages that may include an image you can query about.

Make sure to update your transformers installation via pip install --upgrade transformers.

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "If I had to write a haiku for this one, it would be: "}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

Use with `llama`

Please, follow the instructions in the repository.

To download the original checkpoints, you can use huggingface-cli as follows:

huggingface-cli download meta-llama/Llama-3.2-11B-Vision-Instruct --include "original/*" --local-dir Llama-3.2-11B-Vision-Instruct

Hardware and Software

Training Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on production infrastructure.

Training Energy Use: Training utilized a cumulative of 2.02M GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency.

Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 584 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq.

...

Quantizations & VRAM

Q4_K_M4.5 bpw

6.5 GB

VRAM required

94%

Quality

Q6_K6.5 bpw

9.2 GB

VRAM required

97%

Quality

Q8_08 bpw

11.2 GB

VRAM required

100%

Quality

FP1616 bpw

21.9 GB

VRAM required

100%

Quality

Benchmarks (6)

IFEval75.5

MMLU-PRO28.0

BBH26.9

MATH20.2

MUSR6.6

GPQA5.5

Run with Ollama

$ollama run llama3.2:10.7b

HuggingFace Ollama Library GGUF Downloads Build Hardware

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

Llama-3.2-11B-Vision

Model Card

Model Information

Intended Use

How to use

Use with transformers

Use with llama

Hardware and Software

Quantizations & VRAM

Benchmarks (6)

Run with Ollama

GPUs that can run this model

Use with `llama`