Shanghai AI Lab/Dense

InternLM2.5 20B

chatcodingreasoningThinkingTool Use

19.8B

Parameters

1024K

Context length

Benchmarks

Quantizations

80K

HF downloads

Architecture

Dense

Released

2024-08-05

Layers

KV Heads

Head Dim

128

Family

internlm

Model Card

View on HuggingFace

InternLM

</div>

💻Github Repo • 🤔Reporting Issues • 📜Technical Report

</div>

Introduction

InternLM2.5 has open-sourced a 20 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:

Outstanding reasoning capability: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-27B.
Stronger tool use: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation has be released in MindSearch. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See examples.

InternLM2.5-20B-Chat

Performance Evaluation

We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool OpenCompass. The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the OpenCompass leaderboard for more evaluation results.

Benchmark	InternLM2.5-20B-Chat	Gemma2-27B-IT
MMLU (5-shot)	73.5	75.0
CMMLU (5-shot)	79.7	63.3
BBH (3-shot CoT)	76.3	71.5
MATH (0-shot CoT)	64.7	50.1
GPQA (0-shot)	33.3	29.3

The evaluation results were obtained from OpenCompass, and evaluation configuration can be found in the configuration files provided by OpenCompass.
The evaluation data may have numerical differences due to the version iteration of OpenCompass, so please refer to the latest evaluation results of OpenCompass.

Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.

Import from Transformers

To load the InternLM2.5 20B Chat model using Transformers, use the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-20b-chat", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-20b-chat", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
# Hello! How can I help you today?
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)

The responses can be streamed using stream_chat:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "internlm/internlm2_5-20b-chat"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = model.eval()
length = 0
for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
    print(response[length:], flush=True, end="")
    length = len(response)

Deployment

llama.cpp

internlm/internlm2_5-20b-chat-gguf offers internlm2_5-20b-chat models in GGUF format in both half precision and various low-bit quantized versions, including q5_0, q5_k_m, q6_k, and q8_0.

LMDeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.

pip install lmdeploy

You can run batch inference locally with the following python code:

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2_5-20b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

Or you can launch an OpenAI compatible server with the following command:

lmdeploy serve api_server internlm/internlm2_5-20b-chat --model-name internlm2_5-20b-chat --server-port 23333

Then you can send a chat request to the server:

curl http://localhost:23333/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "internlm2_5-20b-chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Introduce deep learning to me."}
    ]
    }'

Find more details in the LMDeploy documentation

vLLM

Launch OpenAI compatible server with vLLM>=0.3.2:

pip install vllm

python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-20b-chat --served-model-name internlm2_5-20b-chat --trust-remote-code

Then you can send a chat request to the server:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "internlm2_5-20b-chat",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Introduce deep learning to me."}
    ]
    }'

Find more details in the vLLM documentation

Quantizations & VRAM

Q4_K_M4.5 bpw

11.9 GB

VRAM required

94%

Quality

Q6_K6.5 bpw

16.9 GB

VRAM required

97%

Quality

Q8_08 bpw

20.6 GB

VRAM required

100%

Quality

FP1616 bpw

40.4 GB

VRAM required

100%

Quality

Benchmarks (8)

Arena Elo1164

IFEval75.0

HumanEval72.0

MATH68.0

BBH62.8

MMLU-PRO52.0

MUSR16.7

GPQA9.5

HuggingFace GGUF Downloads Build Hardware

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.