InternLM2.5 20B
Model Card
View on HuggingFaceInternLM
</div>💻Github Repo • 🤔Reporting Issues • 📜Technical Report
</div>Introduction
InternLM2.5 has open-sourced a 20 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
-
Outstanding reasoning capability: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-27B.
-
Stronger tool use: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation has be released in MindSearch. InternLM2.5 has better tool utilization-related capabilities in instruction following, tool selection and reflection. See examples.
InternLM2.5-20B-Chat
Performance Evaluation
We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool OpenCompass. The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the OpenCompass leaderboard for more evaluation results.
| Benchmark | InternLM2.5-20B-Chat | Gemma2-27B-IT |
|---|---|---|
| MMLU (5-shot) | 73.5 | 75.0 |
| CMMLU (5-shot) | 79.7 | 63.3 |
| BBH (3-shot CoT) | 76.3 | 71.5 |
| MATH (0-shot CoT) | 64.7 | 50.1 |
| GPQA (0-shot) | 33.3 | 29.3 |
- The evaluation results were obtained from OpenCompass, and evaluation configuration can be found in the configuration files provided by OpenCompass.
- The evaluation data may have numerical differences due to the version iteration of OpenCompass, so please refer to the latest evaluation results of OpenCompass.
Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
Import from Transformers
To load the InternLM2.5 20B Chat model using Transformers, use the following code:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-20b-chat", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-20b-chat", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
# Hello! How can I help you today?
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)
The responses can be streamed using stream_chat:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "internlm/internlm2_5-20b-chat"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()
length = 0
for response, history in model.stream_chat(tokenizer, "Hello", history=[]):
print(response[length:], flush=True, end="")
length = len(response)
Deployment
llama.cpp
internlm/internlm2_5-20b-chat-gguf offers internlm2_5-20b-chat models in GGUF format in both half precision and various low-bit quantized versions, including q5_0, q5_k_m, q6_k, and q8_0.
LMDeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
pip install lmdeploy
You can run batch inference locally with the following python code:
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2_5-20b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
Or you can launch an OpenAI compatible server with the following command:
lmdeploy serve api_server internlm/internlm2_5-20b-chat --model-name internlm2_5-20b-chat --server-port 23333
Then you can send a chat request to the server:
curl http://localhost:23333/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm2_5-20b-chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Introduce deep learning to me."}
]
}'
Find more details in the LMDeploy documentation
vLLM
Launch OpenAI compatible server with vLLM>=0.3.2:
pip install vllm
python -m vllm.entrypoints.openai.api_server --model internlm/internlm2_5-20b-chat --served-model-name internlm2_5-20b-chat --trust-remote-code
Then you can send a chat request to the server:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm2_5-20b-chat",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Introduce deep learning to me."}
]
}'
Find more details in the vLLM documentation
Quantizations & VRAM
Benchmarks (8)
GPUs that can run this model
At Q4_K_M quantization. Sorted by minimum VRAM.
Find the best GPU for InternLM2.5 20B
Build Hardware for InternLM2.5 20B