Mistral AI/Dense

Ministral 8B

chatTool Use

Parameters

128K

Context length

Benchmarks

Quantizations

100K

HF downloads

Architecture

Dense

Released

2024-10-16

Layers

KV Heads

Head Dim

128

Family

mistral

Model Card

View on HuggingFace

Model Card for Ministral-8B-Instruct-2410

We introduce two new state-of-the-art models for local intelligence, on-device computing, and at-the-edge use cases. We call them les Ministraux: Ministral 3B and Ministral 8B.

The Ministral-8B-Instruct-2410 Language Model is an instruct fine-tuned model significantly outperforming existing models of similar size, released under the Mistral Research License.

If you are interested in using Ministral-3B or Ministral-8B commercially, outperforming Mistral-7B, reach out to us.

For more details about les Ministraux please refer to our release blog post.

Ministral 8B Key features

Released under the Mistral Research License, reach out to us for a commercial license
Trained with a 128k context window with interleaved sliding-window attention
Trained on a large proportion of multilingual and code data
Supports function calling
Vocabulary size of 131k, using the V3-Tekken tokenizer

Basic Instruct Template (V3-Tekken)

<s>[INST]user message[/INST]assistant response</s>[INST]new user message[/INST]

For more information about the tokenizer please refer to mistral-common

Ministral 8B Architecture

Feature	Value
Architecture	Dense Transformer
Parameters	8,019,808,256
Layers	36
Heads	32
Dim	4096
KV Heads (GQA)	8
Hidden Dim	12288
Head Dim	128
Vocab Size	131,072
Context Length	128k
Attention Pattern	Ragged (128k,32k,32k,32k)

Benchmarks

Base Models

Knowledge & Commonsense

Model	MMLU	AGIEval	Winogrande	Arc-c	TriviaQA
Mistral 7B Base	62.5	42.5	74.2	67.9	62.5
Llama 3.1 8B Base	64.7	44.4	74.6	46.0	60.2
Ministral 8B Base	<u>65.0</u>	<u>48.3</u>	<u>75.3</u>	<u>71.9</u>	<u>65.5</u>

Gemma 2 2B Base	52.4	33.8	68.7	42.6	47.8
Llama 3.2 3B Base	56.2	37.4	59.6	43.1	50.7
Ministral 3B Base	<u>60.9</u>	<u>42.1</u>	<u>72.7</u>	<u>64.2</u>	<u>56.7</u>

Code & Math

Model	HumanEval pass@1	GSM8K maj@8
Mistral 7B Base	26.8	32.0
Llama 3.1 8B Base	<u>37.8</u>	42.2
Ministral 8B Base	34.8	<u>64.5</u>

Gemma 2 2B	20.1	35.5
Llama 3.2 3B	14.6	33.5
Ministral 3B	<u>34.2</u>	<u>50.9</u>

Multilingual

Model	French MMLU	German MMLU	Spanish MMLU
Mistral 7B Base	50.6	49.6	51.4
Llama 3.1 8B Base	50.8	52.8	54.6
Ministral 8B Base	<u>57.5</u>	<u>57.4</u>	<u>59.6</u>

Gemma 2 2B Base	41.0	40.1	41.7
Llama 3.2 3B Base	42.3	42.2	43.1
Ministral 3B Base	<u>49.1</u>	<u>48.3</u>	<u>49.5</u>

Instruct Models

Chat/Arena (gpt-4o judge)

Model	MTBench	Arena Hard	Wild bench
Mistral 7B Instruct v0.3	6.7	44.3	33.1
Llama 3.1 8B Instruct	7.5	62.4	37.0
Gemma 2 9B Instruct	7.6	68.7	<u>43.8</u>
Ministral 8B Instruct	<u>8.3</u>	<u>70.9</u>	41.3

Gemma 2 2B Instruct	7.5	51.7	32.5
Llama 3.2 3B Instruct	7.2	46.0	27.2
Ministral 3B Instruct	<u>8.1</u>	<u>64.3</u>	<u>36.3</u>

Code & Math

Model	MBPP pass@1	HumanEval pass@1	Math maj@1
Mistral 7B Instruct v0.3	50.2	38.4	13.2
Gemma 2 9B Instruct	68.5	67.7	47.4
Llama 3.1 8B Instruct	69.7	67.1	49.3
Ministral 8B Instruct	<u>70.0</u>	<u>76.8</u>	<u>54.5</u>

Gemma 2 2B Instruct	54.5	42.7	22.8
Llama 3.2 3B Instruct	64.6	61.0	38.4
Ministral 3B Instruct	<u>67.7</u>	<u>77.4</u>	<u>51.7</u>

Function calling

Model	Internal bench
Mistral 7B Instruct v0.3	6.9
Llama 3.1 8B Instruct	N/A
Gemma 2 9B Instruct	N/A
Ministral 8B Instruct	<u>31.6</u>

Gemma 2 2B Instruct	N/A
Llama 3.2 3B Instruct	N/A
Ministral 3B Instruct	<u>28.4</u>

Usage Examples

vLLM (recommended)

We recommend using this model with the vLLM library to implement production-ready inference pipelines.

[!IMPORTANT] Currently vLLM is capped at 32k context size because interleaved attention kernels for paged attention are not yet implemented in vLLM. Attention kernels for paged attention are being worked on and as soon as it is fully supported in vLLM, this model card will be updated. To take advantage of the full 128k context size we recommend Mistral Inference

Installation

Make sure you install vLLM >= v0.6.4:

pip install --upgrade vllm

Also make sure you have mistral_common >= 1.4.4 installed:

pip install --upgrade mistral_common

You can also make use of a ready-to-go docker image.

Offline

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "mistralai/Ministral-8B-Instruct-2410"

sampling_params = SamplingParams(max_tokens=8192)

# note that running Ministral 8B on a single GPU requires 24 GB of GPU RAM
# If you want to divide the GPU requirement over multiple devices, please add *e.g.* `tensor_parallel=2`
llm = LLM(model=model_name, tokenizer_mode="mistral", config_format="mistral", load_format="mistral")

prompt = "Do we need to think for 10 seconds to find the answer of 1 + 1?"

messages = [
    {
        "role": "user",
        "content": prompt
    },
]

outputs = llm.chat(messages, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
# You don't need to think for 10 seconds to find the answer to 1 + 1. The answer is 2,
# and you can easily add these two numbers in your mind very quickly without any delay.

Server

You can also use Ministral-8B in a server/client setting.

Spin up a server:

vllm serve mistralai/Ministral-8B-Instruct-2410 --tokenizer_mode mistral --config_format mistral --load_format mistral

...

Quantizations & VRAM

Q4_K_M4.5 bpw

5.0 GB

VRAM required

94%

Quality

Q6_K6.5 bpw

7.0 GB

VRAM required

97%

Quality

Q8_08 bpw

8.5 GB

VRAM required

100%

Quality

FP1616 bpw

16.5 GB

VRAM required

100%

Quality

Benchmarks (8)

IFEval68.0

HumanEval62.0

BBH60.0

MMLU-PRO42.0

BigCodeBench19.5

GPQA10.4

MATH6.9

MUSR5.6

HuggingFace GGUF Downloads Build Hardware

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.