Alibaba/Dense

Qwen 1.5 110B

chat
110B
Parameters
32K
Context length
7
Benchmarks
4
Quantizations
0
Architecture
Dense
Released
2024-02-04
Layers
80
KV Heads
8
Head Dim
128
Family
qwen

Qwen1.5-110B-Chat

Introduction

Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include:

  • 9 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B, 72B, and 110B dense models, and an MoE model of 14B with 2.7B activated;
  • Significant performance improvement in human preference for chat models;
  • Multilingual support of both base and chat models;
  • Stable support of 32K context length for models of all sizes
  • No need of trust_remote_code.

For more details, please refer to our blog post and GitHub repo.

Model Details

Qwen1.5 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. For the beta version, temporarily we did not include GQA (except for 32B and 110B) and the mixture of SWA and full attention.

Training details

We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization.

Requirements

The code of Qwen1.5 has been in the latest Hugging face transformers and we advise you to install transformers>=4.37.0, or you might encounter the following error:

KeyError: 'qwen2'

Quickstart

Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-110B-Chat",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-110B-Chat")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Tips

  • If you encounter code switching or other bad cases, we advise you to use our provided hyper-parameters in generation_config.json.

Quantizations & VRAM

Q4_K_M4.5 bpw
62.4 GB
VRAM required
94%
Quality
Q6_K6.5 bpw
89.9 GB
VRAM required
97%
Quality
Q8_08 bpw
110.5 GB
VRAM required
100%
Quality
FP1616 bpw
220.5 GB
VRAM required
100%
Quality

Benchmarks (7)

IFEval59.4
BBH45.0
MMLU-PRO42.5
BigCodeBench35.0
MATH23.4
MUSR16.3
GPQA12.2

Run with Ollama

$ollama run qwen:110b

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

Apple M1 Ultra (64GB)
64 GB VRAM • 800 GB/s
APPLE
$2499
Apple M2 Ultra (64GB)
64 GB VRAM • 800 GB/s
APPLE
$2999
Apple M4 Max (64GB)
64 GB VRAM • 546 GB/s
APPLE
$2899
Apple M2 Max (64GB)
64 GB VRAM • 400 GB/s
APPLE
$2299
Apple M3 Max (64GB)
64 GB VRAM • 300 GB/s
APPLE
$2799
Apple M4 Pro (64GB)
64 GB VRAM • 273 GB/s
APPLE
$2599
AMD Radeon Instinct MI200
64 GB VRAM • 1640 GB/s
AMD
AMD Radeon Instinct MI210
64 GB VRAM • 1640 GB/s
AMD
NVIDIA H100 SXM5 64 GB
64 GB VRAM • 2020 GB/s
NVIDIA
NVIDIA Jetson AGX Orin 64 GB
64 GB VRAM • 205 GB/s
NVIDIA
NVIDIA Jetson T4000
64 GB VRAM • 273 GB/s
NVIDIA
NVIDIA RTX PRO 5000 72 GB Blackwell
72 GB VRAM • 1340 GB/s
NVIDIA
NVIDIA H100 SXM5 80GB
80 GB VRAM • 3350 GB/s
NVIDIA
$25000
NVIDIA H100 PCIe 80GB
80 GB VRAM • 2000 GB/s
NVIDIA
$25000
NVIDIA A100 SXM 80GB
80 GB VRAM • 2039 GB/s
NVIDIA
$10000
NVIDIA A100 PCIe 80GB
80 GB VRAM • 1935 GB/s
NVIDIA
$10000
NVIDIA A100 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A100 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A100X
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H100 SXM5 80 GB
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA H100 CNX
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA A800 PCIe 80 GB
80 GB VRAM • 1940 GB/s
NVIDIA
NVIDIA A800 SXM4 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 PCIe 80 GB
80 GB VRAM • 2040 GB/s
NVIDIA
NVIDIA H800 SXM5
80 GB VRAM • 3360 GB/s
NVIDIA
NVIDIA RTX 6000D
84 GB VRAM • 1570 GB/s
NVIDIA
NVIDIA B200
90 GB VRAM • 4100 GB/s
NVIDIA
NVIDIA H100 NVL 94 GB
94 GB VRAM • 3940 GB/s
NVIDIA
NVIDIA H100 SXM5 94 GB
94 GB VRAM • 3360 GB/s
NVIDIA

Find the best GPU for Qwen 1.5 110B

Build Hardware for Qwen 1.5 110B