Mistral AI/Mixture of Experts

Mistral Small 4 119B

chatcodingreasoningmathvisiontool_useThinkingTool Use
119B
Parameters (6B active)
250K
Context length
7
Benchmarks
6
Quantizations
200K
HF downloads
Architecture
MoE
Released
2026-03-16
Layers
56
KV Heads
8
Head Dim
128
Family
mistral

Mistral Small 4 119B A6B

Mistral Small 4 is a powerful hybrid model capable of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families—Instruct, Reasoning (previously called Magistral), and Devstral—into a single, unified model.

With its multimodal capabilities, efficient architecture, and flexible mode switching, it is a powerful general-purpose model for any task. In a latency-optimized setup, Mistral Small 4 achieves a 40% reduction in end-to-end completion time, and in a throughput-optimized setup, it handles 3x more requests per second compared to Mistral Small 3.

To further improve efficiency you can either take advantages of:

Key Features

Mistral Small 4 includes the following architectural choices:

  • MoE: 128 experts, 4 active.
  • 119B parameters, with 6.5B activated per token.
  • 256k context length.
  • Multimodal input: Accepts both text and image input, with text output.
  • Instruct and Reasoning functionalities with function calls (reasoning effort configurable per request).

Mistral Small 4 offers the following capabilities:

  • Reasoning Mode: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested.
  • Vision: Analyzes images and provides insights based on visual content, in addition to text.
  • Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic.
  • System Prompt: Strong adherence and support for system prompts.
  • Agentic: Best-in-class agentic capabilities with native function calling and JSON output.
  • Speed-Optimized: Delivers best-in-class performance and speed.
  • Apache 2.0 License: Open-source license for both commercial and non-commercial use.
  • Large Context Window: Supports a 256k context window.

Recommended Settings

  • Reasoning Effort:
    • 'none' → Do not use reasoning
    • 'high' → Use reasoning (recommended for complex prompts) Use reasoning_effort="high" for complex tasks
  • Temperature: 0.7 for reasoning_effort="high". Temp between 0.0 and 0.7 for reasoning_effort="none" depending on task.

Use Cases

Mistral Small 4 is designed for general chat assistants, coding, agentic tasks, and reasoning tasks (with reasoning mode toggled). Its multimodal capabilities also enable document and image understanding for data extraction and analysis.

Its capabilities are ideal for:

  • Developers interested in coding and agentic capabilities for SWE automation and codebase exploration.
  • Enterprises seeking general chat assistants, agents, and document understanding.
  • Researchers leveraging its math and research capabilities.

Mistral Small 4 is also well-suited for customization and fine-tuning for more specialized tasks.

Examples

  • General chat assistant
  • Document parsing and extraction
  • Coding agent
  • Research assistant
  • Customization & fine-tuning
  • And more...

Benchmarks

Comparison with internal models

Depending on your tasks you can trigger reasoning thanks to the support of the per-request parameter reasoning_effort. Set it to:

Comparing Reasoning Models

Comparison with other models

Mistral Small 4 with reasoning achieves competitive scores, matching or surpassing GPT-OSS 120B across all three benchmarks while generating significantly shorter outputs. On AA LCR, Mistral Small 4 scores 0.72 with just 1.6K characters, whereas Qwen models require 3.5-4x more output (5.8-6.1K) for comparable performance. On LiveCodeBench, Mistral Small 4 outperforms GPT-OSS 120B while producing 20% less output. This efficiency reduces latency, inference costs, and improves user experience.

Usage

You can find Mistral Small 4 support on multiple libraries for inference and fine-tuning. We here thank everyone contributors and maintainers that helped us making it happen.

Inference

The model can be deployed with:

For optimal performance, we recommend using the Mistral AI API if local serving is subpar.

Fine-Tuning

Fine-tune the model via:

vLLM (Recommended)

We recommend using Mistral Small 4 with the vLLM library for production-ready inference.

Installation

[!Tip] Use our custom Docker image with fixes for tool calling and reasoning parsing in vLLM, and the latest Transformers version. We are working with the vLLM team to merge these fixes soon.

Custom Docker Use the following Docker image: mistralllm/vllm-ms4:latest:

docker pull mistralllm/vllm-ms4:latest
docker run -it mistralllm/vllm-ms4:latest

Manual Install Alternatively, install vllm from this PR: Add Mistral Guidance.

Note: This PR is expected to be merged into vllm main in the next 1-2 weeks (as of 16.03.2026). Track updates here.

  1. Clone vLLM:
    git clone --branch fix_mistral_parsing https://github.com/juliendenize/vllm.git
    
  2. Install with pre-compiled kernels:
    VLLM_USE_PRECOMPILED=1 pip install --editable .
    
  3. Install transformers from main:
    uv pip install git+https://github.com/huggingface/transformers.git
    
    Ensure mistral_common >= 1.10.0 is installed:
    python -c "import mistral_common; print(mistral_common.__version__)"
    

Serve the Model

We recommend a server/client setup:

vllm serve mistralai/Mistral-Small-4-119B-2603 --max-model-len 262144 --tensor-parallel-size 2 --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --max_num_batched_tokens 16384 --max_num_seqs 128 \
  --gpu_memory_utilization 0.8

Ping the Server

<details> <summary>Instruction Following</summary>

Mistral Small 4 can follow your instructions to the letter.

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

...

Quantizations & VRAM

Q3_K_M3.5 bpw
52.1 GB
VRAM required
90%
Quality
Q4_K_M4.5 bpw
66.9 GB
VRAM required
94%
Quality
Q5_K_M5.5 bpw
81.8 GB
VRAM required
96%
Quality
Q6_K6.5 bpw
96.7 GB
VRAM required
97%
Quality
Q8_08 bpw
119.0 GB
VRAM required
100%
Quality
FP1616 bpw
238.0 GB
VRAM required
100%
Quality

Benchmarks (7)

Arena Elo1480
IFEval84.0
BBH52.7
MMLU-PRO50.7
MATH49.5
GPQA24.9
MUSR17.2

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

NVIDIA RTX PRO 5000 72 GB Blackwell
72 GB VRAM • 1340 GB/s
NVIDIA
NVIDIA H100 SXM5 80GB
80 GB VRAM • 3350 GB/s
NVIDIA
$25000
NVIDIA H100 PCIe 80GB
80 GB VRAM • 2000 GB/s
NVIDIA
$25000
NVIDIA A100 SXM 80GB
80 GB VRAM • 2039 GB/s
NVIDIA
$10000
NVIDIA A100 PCIe 80GB
80 GB VRAM • 1935 GB/s
NVIDIA
$10000
NVIDIA A100 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A100 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A100X
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 SXM5 80 GB
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA H100 CNX
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A800 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A800 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 SXM5
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA RTX 6000D
84 GB VRAM • 1570 GB/s
NVIDIA
NVIDIA B200
90 GB VRAM • 4100 GB/s
NVIDIA
NVIDIA H100 NVL 94 GB
94 GB VRAM • 3940 GB/s
NVIDIA
NVIDIA H100 SXM5 94 GB
94 GB VRAM • 3360 GB/s
NVIDIA
RTX Pro 6000
96 GB VRAM • 1792 GB/s
NVIDIA
$8565
NVIDIA H100 PCIe 96 GB
96 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA H100 SXM5 96 GB
96 GB VRAM • 3360 GB/s
NVIDIA
Intel Data Center GPU Max 1350
96 GB VRAM • 2460 GB/s
INTEL
NVIDIA RTX PRO 6000 Blackwell Server
96 GB VRAM • 1790 GB/s
NVIDIA
AMD Instinct MI300A
120 GB VRAM • 5300 GB/s
AMD
$12000
Apple M4 Max (128GB)
128 GB VRAM • 546 GB/s
APPLE
$3999
AMD Instinct MI250X
128 GB VRAM • 3277 GB/s
AMD
$10000
Apple M1 Ultra (128GB)
128 GB VRAM • 800 GB/s
APPLE
$4999
Apple M2 Ultra (128GB)
128 GB VRAM • 800 GB/s
APPLE
$3999
AMD Radeon Instinct MI250
128 GB VRAM • 3280 GB/s
AMD

Find the best GPU for Mistral Small 4 119B

Build Hardware for Mistral Small 4 119B