Mistral AI/Dense

Mistral-Large 123B

tool_usechatThinkingTool Use
123B
Parameters
128K
Context length
7
Benchmarks
4
Quantizations
0
Architecture
Dense
Released
2024-07-24
Layers
88
KV Heads
8
Head Dim
128
Family
mistral

Model Card for Mistral-Large-Instruct-2407

Mistral-Large-Instruct-2407 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities.

For more details about this model please refer to our release blog post.

Key features

  • Multi-lingual by design: Dozens of languages supported, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch and Polish.
  • Proficient in coding: Trained on 80+ coding languages such as Python, Java, C, C++, Javacsript, and Bash. Also trained on more specific languages such as Swift and Fortran.
  • Agentic-centric: Best-in-class agentic capabilities with native function calling and JSON outputting.
  • Advanced Reasoning: State-of-the-art mathematical and reasoning capabilities.
  • Mistral Research License: Allows usage and modification for research and non-commercial usages.
  • Large Context: A large 128k context window.

Metrics

Base Pretrained Benchmarks

BenchmarkScore
MMLU84.0%

Base Pretrained Multilingual Benchmarks (MMLU)

BenchmarkScore
French82.8%
German81.6%
Spanish82.7%
Italian82.7%
Dutch80.7%
Portuguese81.6%
Russian79.0%
Korean60.1%
Japanese78.8%
Chinese74.8%

Instruction Benchmarks

BenchmarkScore
MT Bench8.63
Wild Bench56.3
Arena Hard73.2

Code & Reasoning Benchmarks

BenchmarkScore
Human Eval92%
Human Eval Plus87%
MBPP Base80%
MBPP Plus69%

Math Benchmarks

BenchmarkScore
GSM8K93%
Math Instruct (0-shot, no CoT)70%
Math Instruct (0-shot, CoT)71.5%

Usage

The model can be used with two different frameworks

Mistral Inference

Install

It is recommended to use mistralai/Mistral-Large-Instruct-2407 with mistral-inference. For HF transformers code snippets, please keep scrolling.

pip install mistral_inference

Download

from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', 'Large')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Mistral-Large-Instruct-2407", allow_patterns=["params.json", "consolidated-*.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)

Chat

After installing mistral_inference, a mistral-chat CLI command should be available in your environment. Given the size of this model, you will need a node with several GPUs (more than 300GB cumulated vRAM). If you have 8 GPUs on your machine, you can chat with the model using

torchrun --nproc-per-node 8 --no-python mistral-chat $HOME/mistral_models/Large --instruct --max_tokens 256 --temperature 0.7

E.g. Try out something like:

How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar.

Instruct following

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tokenizer.model.v3")
model = Transformer.from_folder(mistral_models_path)

prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar."

completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.7, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])

print(result)

Function calling

from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest


tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tokenizer.model.v3")
model = Transformer.from_folder(mistral_models_path)

completion_request = ChatCompletionRequest(
    tools=[
        Tool(
            function=Function(
                name="get_current_weather",
                description="Get the current weather",
                parameters={
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "format": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The temperature unit to use. Infer this from the users location.",
                        },
                    },
                    "required": ["location", "format"],
                },
            )
        )
    ],
    messages=[
        UserMessage(content="What's the weather like today in Paris?"),
        ],
)

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=256, temperature=0.7, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])

print(result)

Transformers

If you want to use Hugging Face transformers to generate text, you can do something like this.

from transformers import pipeline

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
chatbot = pipeline("text-generation", model="mistralai/Mistral-Large-Instruct-2407")
chatbot(messages)

Function calling with transformers

To use this example, you'll need transformers version 4.42.0 or higher. Please see the function calling guide in the transformers docs for more information.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-Large-Instruct-2407"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_current_weather(location: str, format: str):
    """
    Get the current weather

    Args:
        location: The city and state, e.g. San Francisco, CA
        format: The temperature unit to use. Infer this from the users location. (choices: ["celsius", "fahrenheit"])
    """
    pass

conversation = [{"role": "user", "content": "What's the weather like in Paris?"}]
tools = [get_current_weather]

# format and tokenize the tool use prompt 
inputs = tokenizer.apply_chat_template(
            conversation,
            tools=tools,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt",
)

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

...

Quantizations & VRAM

Q4_K_M4.5 bpw
69.7 GB
VRAM required
94%
Quality
Q6_K6.5 bpw
100.4 GB
VRAM required
97%
Quality
Q8_08 bpw
123.5 GB
VRAM required
100%
Quality
FP1616 bpw
246.5 GB
VRAM required
100%
Quality

Benchmarks (7)

Arena Elo1267
IFEval84.0
BBH52.7
MMLU-PRO50.7
MATH49.5
GPQA24.9
MUSR17.2

Run with Ollama

$ollama run mistral:123b

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

NVIDIA RTX PRO 5000 72 GB Blackwell
72 GB VRAM • 1340 GB/s
NVIDIA
NVIDIA H100 SXM5 80GB
80 GB VRAM • 3350 GB/s
NVIDIA
$25000
NVIDIA H100 PCIe 80GB
80 GB VRAM • 2000 GB/s
NVIDIA
$25000
NVIDIA A100 SXM 80GB
80 GB VRAM • 2039 GB/s
NVIDIA
$10000
NVIDIA A100 PCIe 80GB
80 GB VRAM • 1935 GB/s
NVIDIA
$10000
NVIDIA A100 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A100 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A100X
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 SXM5 80 GB
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA H100 CNX
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A800 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A800 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 SXM5
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA RTX 6000D
84 GB VRAM • 1570 GB/s
NVIDIA
NVIDIA B200
90 GB VRAM • 4100 GB/s
NVIDIA
NVIDIA H100 NVL 94 GB
94 GB VRAM • 3940 GB/s
NVIDIA
NVIDIA H100 SXM5 94 GB
94 GB VRAM • 3360 GB/s
NVIDIA
RTX Pro 6000
96 GB VRAM • 1792 GB/s
NVIDIA
$8565
NVIDIA H100 PCIe 96 GB
96 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA H100 SXM5 96 GB
96 GB VRAM • 3360 GB/s
NVIDIA
Intel Data Center GPU Max 1350
96 GB VRAM • 2460 GB/s
INTEL
NVIDIA RTX PRO 6000 Blackwell Server
96 GB VRAM • 1790 GB/s
NVIDIA
AMD Instinct MI300A
120 GB VRAM • 5300 GB/s
AMD
$12000
Apple M4 Max (128GB)
128 GB VRAM • 546 GB/s
APPLE
$3999
AMD Instinct MI250X
128 GB VRAM • 3277 GB/s
AMD
$10000
Apple M1 Ultra (128GB)
128 GB VRAM • 800 GB/s
APPLE
$4999
Apple M2 Ultra (128GB)
128 GB VRAM • 800 GB/s
APPLE
$3999
AMD Radeon Instinct MI250
128 GB VRAM • 3280 GB/s
AMD

Find the best GPU for Mistral-Large 123B

Build Hardware for Mistral-Large 123B