Shanghai AI Lab/Dense

InternLM2.5 20B

chatcodingreasoningThinkingTool Use
19.8B
Parameters
1024K
Context length
8
Benchmarks
4
Quantizations
80K
HF downloads
Architecture
Dense
Released
2024-08-05
Layers
48
KV Heads
8
Head Dim
128
Family
internlm

InternLM

</div>

💻Github Repo🤔Reporting Issues📜Technical Report

</div>

Introduction

InternLM2.5 has open-sourced a 20 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:

  • Outstanding reasoning capability: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-27B.

  • Stronger tool use: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation has be released in MindSearch. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See examples.

InternLM2.5-20B-Chat

Performance Evaluation

We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool OpenCompass. The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the OpenCompass leaderboard for more evaluation results.

BenchmarkInternLM2.5-20B-ChatGemma2-27B-IT
MMLU (5-shot)73.575.0
CMMLU (5-shot)79.763.3
BBH (3-shot CoT)76.371.5
MATH (0-shot CoT)64.750.1
GPQA (0-shot)33.329.3
  • The evaluation results were obtained from OpenCompass, and evaluation configuration can be found in the configuration files provided by OpenCompass.
  • The evaluation data may have numerical differences due to the version iteration of OpenCompass, so please refer to the latest evaluation results of OpenCompass.

Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.

Import from Transformers

To load the InternLM2.5 20B Chat model using Transformers, use the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-20b-chat", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-20b-chat", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
# Hello! How can I help you today?
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)

The responses can be streamed using stream_chat:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "internlm/internlm2_5-20b-chat"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = model.eval()
length = 0
for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
    print(response[length:], flush=True, end="")
    length = len(response)

Deployment

llama.cpp

internlm/internlm2_5-20b-chat-gguf offers internlm2_5-20b-chat models in GGUF format in both half precision and various low-bit quantized versions, including q5_0, q5_k_m, q6_k, and q8_0.

LMDeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.

pip install lmdeploy

You can run batch inference locally with the following python code:

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2_5-20b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Or you can launch an OpenAI compatible server with the following command:

lmdeploy serve api_server internlm/internlm2_5-20b-chat --model-name internlm2_5-20b-chat --server-port 23333 

Then you can send a chat request to the server:

curl http://localhost:23333/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "internlm2_5-20b-chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Introduce deep learning to me."}
    ]
    }'

Find more details in the LMDeploy documentation

vLLM

Launch OpenAI compatible server with vLLM>=0.3.2:

pip install vllm
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-20b-chat --served-model-name internlm2_5-20b-chat --trust-remote-code

Then you can send a chat request to the server:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "internlm2_5-20b-chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Introduce deep learning to me."}
    ]
    }'

Find more details in the vLLM documentation

Quantizations & VRAM

Q4_K_M4.5 bpw
11.9 GB
VRAM required
94%
Quality
Q6_K6.5 bpw
16.9 GB
VRAM required
97%
Quality
Q8_08 bpw
20.6 GB
VRAM required
100%
Quality
FP1616 bpw
40.4 GB
VRAM required
100%
Quality

Benchmarks (8)

Arena Elo1164
IFEval75.0
HumanEval72.0
MATH68.0
BBH62.8
MMLU-PRO52.0
MUSR16.7
GPQA9.5

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

NVIDIA RTX 5070
12 GB VRAM • 672 GB/s
NVIDIA
$549
NVIDIA RTX 4070 Ti
12 GB VRAM • 504 GB/s
NVIDIA
$799
NVIDIA RTX 4070 SUPER
12 GB VRAM • 504 GB/s
NVIDIA
$599
NVIDIA RTX 4070
12 GB VRAM • 504 GB/s
NVIDIA
$549
NVIDIA RTX 3080 Ti
12 GB VRAM • 912 GB/s
NVIDIA
$550
NVIDIA RTX 3080 12GB
12 GB VRAM • 912 GB/s
NVIDIA
$599
NVIDIA RTX 3060 12GB
12 GB VRAM • 360 GB/s
NVIDIA
$329
AMD RX 7700 XT
12 GB VRAM • 432 GB/s
AMD
$449
AMD RX 6700 XT
12 GB VRAM • 384 GB/s
AMD
$344
AMD RX 6750 XT
12 GB VRAM • 432 GB/s
AMD
$299
Intel Arc B580
12 GB VRAM • 456 GB/s
INTEL
$249
NVIDIA Tesla K40c
12 GB VRAM • 288 GB/s
NVIDIA
NVIDIA Tesla K40d
12 GB VRAM • 288 GB/s
NVIDIA
NVIDIA Tesla K40m
12 GB VRAM • 288 GB/s
NVIDIA
NVIDIA Tesla K40s
12 GB VRAM • 288 GB/s
NVIDIA
NVIDIA Tesla K40st
12 GB VRAM • 288 GB/s
NVIDIA
NVIDIA Tesla K40t
12 GB VRAM • 288 GB/s
NVIDIA
NVIDIA Tesla K80
12 GB VRAM • 241 GB/s
NVIDIA
NVIDIA Tesla M40
12 GB VRAM • 288 GB/s
NVIDIA
NVIDIA Tesla P100 PCIe 12 GB
12 GB VRAM • 549 GB/s
NVIDIA
NVIDIA GeForce RTX 2060 12 GB
12 GB VRAM • 336 GB/s
NVIDIA
$140
NVIDIA GeForce RTX 3060 12 GB
12 GB VRAM • 360 GB/s
NVIDIA
$329
NVIDIA GeForce RTX 3060 12 GB GA104
12 GB VRAM • 360 GB/s
NVIDIA
$329
NVIDIA GeForce RTX 3080 Ti
12 GB VRAM • 912 GB/s
NVIDIA
$1199
NVIDIA RTX A2000 12 GB
12 GB VRAM • 288 GB/s
NVIDIA
AMD Radeon RX 6700 XT
12 GB VRAM • 384 GB/s
AMD
$250
AMD Radeon RX 6800M
12 GB VRAM • 384 GB/s
AMD
$300
NVIDIA GeForce RTX 3080 12 GB
12 GB VRAM • 912 GB/s
NVIDIA
$699
AMD Radeon RX 6750 XT
12 GB VRAM • 432 GB/s
AMD
$287
AMD Radeon RX 6850M XT
12 GB VRAM • 432 GB/s
AMD

Find the best GPU for InternLM2.5 20B

Build Hardware for InternLM2.5 20B