Ministral 8B
Model Card
View on HuggingFaceModel Card for Ministral-8B-Instruct-2410
We introduce two new state-of-the-art models for local intelligence, on-device computing, and at-the-edge use cases. We call them les Ministraux: Ministral 3B and Ministral 8B.
The Ministral-8B-Instruct-2410 Language Model is an instruct fine-tuned model significantly outperforming existing models of similar size, released under the Mistral Research License.
If you are interested in using Ministral-3B or Ministral-8B commercially, outperforming Mistral-7B, reach out to us.
For more details about les Ministraux please refer to our release blog post.
Ministral 8B Key features
- Released under the Mistral Research License, reach out to us for a commercial license
- Trained with a 128k context window with interleaved sliding-window attention
- Trained on a large proportion of multilingual and code data
- Supports function calling
- Vocabulary size of 131k, using the V3-Tekken tokenizer
Basic Instruct Template (V3-Tekken)
<s>[INST]user message[/INST]assistant response</s>[INST]new user message[/INST]
For more information about the tokenizer please refer to mistral-common
Ministral 8B Architecture
| Feature | Value |
|---|---|
| Architecture | Dense Transformer |
| Parameters | 8,019,808,256 |
| Layers | 36 |
| Heads | 32 |
| Dim | 4096 |
| KV Heads (GQA) | 8 |
| Hidden Dim | 12288 |
| Head Dim | 128 |
| Vocab Size | 131,072 |
| Context Length | 128k |
| Attention Pattern | Ragged (128k,32k,32k,32k) |
Benchmarks
Base Models
<u>Knowledge & Commonsense</u>
| Model | MMLU | AGIEval | Winogrande | Arc-c | TriviaQA |
|---|---|---|---|---|---|
| Mistral 7B Base | 62.5 | 42.5 | 74.2 | 67.9 | 62.5 |
| Llama 3.1 8B Base | 64.7 | 44.4 | 74.6 | 46.0 | 60.2 |
| Ministral 8B Base | <u>65.0</u> | <u>48.3</u> | <u>75.3</u> | <u>71.9</u> | <u>65.5</u> |
| Gemma 2 2B Base | 52.4 | 33.8 | 68.7 | 42.6 | 47.8 |
| Llama 3.2 3B Base | 56.2 | 37.4 | 59.6 | 43.1 | 50.7 |
| Ministral 3B Base | <u>60.9</u> | <u>42.1</u> | <u>72.7</u> | <u>64.2</u> | <u>56.7</u> |
<u>Code & Math</u>
| Model | HumanEval pass@1 | GSM8K maj@8 |
|---|---|---|
| Mistral 7B Base | 26.8 | 32.0 |
| Llama 3.1 8B Base | <u>37.8</u> | 42.2 |
| Ministral 8B Base | 34.8 | <u>64.5</u> |
| Gemma 2 2B | 20.1 | 35.5 |
| Llama 3.2 3B | 14.6 | 33.5 |
| Ministral 3B | <u>34.2</u> | <u>50.9</u> |
<u>Multilingual</u>
| Model | French MMLU | German MMLU | Spanish MMLU |
|---|---|---|---|
| Mistral 7B Base | 50.6 | 49.6 | 51.4 |
| Llama 3.1 8B Base | 50.8 | 52.8 | 54.6 |
| Ministral 8B Base | <u>57.5</u> | <u>57.4</u> | <u>59.6</u> |
| Gemma 2 2B Base | 41.0 | 40.1 | 41.7 |
| Llama 3.2 3B Base | 42.3 | 42.2 | 43.1 |
| Ministral 3B Base | <u>49.1</u> | <u>48.3</u> | <u>49.5</u> |
Instruct Models
<u>Chat/Arena (gpt-4o judge)</u>
| Model | MTBench | Arena Hard | Wild bench |
|---|---|---|---|
| Mistral 7B Instruct v0.3 | 6.7 | 44.3 | 33.1 |
| Llama 3.1 8B Instruct | 7.5 | 62.4 | 37.0 |
| Gemma 2 9B Instruct | 7.6 | 68.7 | <u>43.8</u> |
| Ministral 8B Instruct | <u>8.3</u> | <u>70.9</u> | 41.3 |
| Gemma 2 2B Instruct | 7.5 | 51.7 | 32.5 |
| Llama 3.2 3B Instruct | 7.2 | 46.0 | 27.2 |
| Ministral 3B Instruct | <u>8.1</u> | <u>64.3</u> | <u>36.3</u> |
<u>Code & Math</u>
| Model | MBPP pass@1 | HumanEval pass@1 | Math maj@1 |
|---|---|---|---|
| Mistral 7B Instruct v0.3 | 50.2 | 38.4 | 13.2 |
| Gemma 2 9B Instruct | 68.5 | 67.7 | 47.4 |
| Llama 3.1 8B Instruct | 69.7 | 67.1 | 49.3 |
| Ministral 8B Instruct | <u>70.0</u> | <u>76.8</u> | <u>54.5</u> |
| Gemma 2 2B Instruct | 54.5 | 42.7 | 22.8 |
| Llama 3.2 3B Instruct | 64.6 | 61.0 | 38.4 |
| Ministral 3B Instruct | <u>67.7</u> | <u>77.4</u> | <u>51.7</u> |
<u>Function calling</u>
| Model | Internal bench |
|---|---|
| Mistral 7B Instruct v0.3 | 6.9 |
| Llama 3.1 8B Instruct | N/A |
| Gemma 2 9B Instruct | N/A |
| Ministral 8B Instruct | <u>31.6</u> |
| Gemma 2 2B Instruct | N/A |
| Llama 3.2 3B Instruct | N/A |
| Ministral 3B Instruct | <u>28.4</u> |
Usage Examples
vLLM (recommended)
We recommend using this model with the vLLM library to implement production-ready inference pipelines.
[!IMPORTANT] Currently vLLM is capped at 32k context size because interleaved attention kernels for paged attention are not yet implemented in vLLM. Attention kernels for paged attention are being worked on and as soon as it is fully supported in vLLM, this model card will be updated. To take advantage of the full 128k context size we recommend Mistral Inference
Installation
Make sure you install vLLM >= v0.6.4:
pip install --upgrade vllm
Also make sure you have mistral_common >= 1.4.4 installed:
pip install --upgrade mistral_common
You can also make use of a ready-to-go docker image.
Offline
from vllm import LLM
from vllm.sampling_params import SamplingParams
model_name = "mistralai/Ministral-8B-Instruct-2410"
sampling_params = SamplingParams(max_tokens=8192)
# note that running Ministral 8B on a single GPU requires 24 GB of GPU RAM
# If you want to divide the GPU requirement over multiple devices, please add *e.g.* `tensor_parallel=2`
llm = LLM(model=model_name, tokenizer_mode="mistral", config_format="mistral", load_format="mistral")
prompt = "Do we need to think for 10 seconds to find the answer of 1 + 1?"
messages = [
{
"role": "user",
"content": prompt
},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
# You don't need to think for 10 seconds to find the answer to 1 + 1. The answer is 2,
# and you can easily add these two numbers in your mind very quickly without any delay.
Server
You can also use Ministral-8B in a server/client setting.
- Spin up a server:
vllm serve mistralai/Ministral-8B-Instruct-2410 --tokenizer_mode mistral --config_format mistral --load_format mistral
...
Quantizations & VRAM
Benchmarks (8)
GPUs that can run this model
At Q4_K_M quantization. Sorted by minimum VRAM.
Find the best GPU for Ministral 8B
Build Hardware for Ministral 8B