OpenBMB/Dense

MiniCPM-V 2.6 8B

visionchatThinking
8B
Parameters
8K
Context length
8
Benchmarks
4
Quantizations
280K
HF downloads
Architecture
Dense
Released
2024-08-06
Layers
40
KV Heads
8
Head Dim
128
Family
other
<h1>A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone</h1>

GitHub | Demo</a>

News

  • [2025.01.14] 🔥🔥 We open source MiniCPM-o 2.6, with significant performance improvement over MiniCPM-V 2.6, and support real-time speech-to-speech conversation and multimodal live streaming. Try it now.

MiniCPM-V 2.6

MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:

  • 🔥 Leading Performance. MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding.

  • 🖼️ Multi Image Understanding and In-context Learning. MiniCPM-V 2.6 can also perform conversation and reasoning over multiple images. It achieves state-of-the-art performance on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.

  • 🎬 Video Understanding. MiniCPM-V 2.6 can also accept video inputs, performing conversation and providing dense captions for spatial-temporal information. It outperforms GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B on Video-MME with/without subtitles.

  • 💪 Strong OCR Capability and Others. MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. Based on the the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports multilingual capabilities on English, Chinese, German, French, Italian, Korean, etc.

  • 🚀 Superior Efficiency. In addition to its friendly size, MiniCPM-V 2.6 also shows state-of-the-art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support real-time video understanding on end-side devices such as iPad.

  • 💫 Easy Usage. MiniCPM-V 2.6 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with Gradio and (6) online web demo.

Evaluation

Single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench:

<sup>*</sup> We evaluate this benchmark using chain-of-thought prompting.

<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.

Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.

Multi-image results on Mantis Eval, BLINK Val, Mathverse mv, Sciverse mv, MIRB:

<sup>*</sup> We evaluate the officially released checkpoint by ourselves.

Video results on Video-MME and Video-ChatGPT:

<details> <summary>Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.</summary>
  • denotes zero image shot and two additional text shots following Flamingo.

<sup>+</sup> We evaluate the pretraining ckpt without SFT.

</details>

Examples

<details> <summary>Click to view more cases.</summary> </details>

We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.

Demo

Click here to try the Demo of MiniCPM-V 2.6.

Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:

Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
decord
# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

Chat with multiple images

<details> <summary> Click to show Python code running MiniCPM-V 2.6 with multiple images input. </summary>
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)
</details>

In-context few-shot learning

<details> <summary> Click to view Python code running MiniCPM-V 2.6 with few-shot input. </summary>
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

...

Quantizations & VRAM

Q4_K_M4.5 bpw
5.5 GB
VRAM required
94%
Quality
Q6_K6.5 bpw
7.0 GB
VRAM required
97%
Quality
Q8_08 bpw
8.6 GB
VRAM required
100%
Quality
FP1616 bpw
16.6 GB
VRAM required
100%
Quality

Benchmarks (8)

MMBench76.0
IFEval60.0
MMLU-PRO48.0
BBH45.0
MMMU45.0
GPQA30.0
MUSR12.0
MATH10.0

Run with Ollama

$ollama run minicpm-v:8b

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

NVIDIA RTX 3050 6GB
6 GB VRAM • 168 GB/s
NVIDIA
$169
Intel Arc A380
6 GB VRAM • 186 GB/s
INTEL
$129
NVIDIA RTX 2060 6GB
6 GB VRAM • 336 GB/s
NVIDIA
$150
NVIDIA GTX 1660 SUPER
6 GB VRAM • 336 GB/s
NVIDIA
$150
NVIDIA GTX 1660 Ti
6 GB VRAM • 288 GB/s
NVIDIA
$140
NVIDIA GTX 1060 6GB
6 GB VRAM • 192 GB/s
NVIDIA
$80
NVIDIA Tesla C2070
6 GB VRAM • 143 GB/s
NVIDIA
NVIDIA Tesla C2075
6 GB VRAM • 150 GB/s
NVIDIA
NVIDIA Tesla C2090
6 GB VRAM • 177 GB/s
NVIDIA
NVIDIA Tesla M2070
6 GB VRAM • 150 GB/s
NVIDIA
NVIDIA Tesla M2070-Q
6 GB VRAM • 150 GB/s
NVIDIA
NVIDIA Tesla M2075
6 GB VRAM • 150 GB/s
NVIDIA
NVIDIA Tesla M2090
6 GB VRAM • 177 GB/s
NVIDIA
NVIDIA Tesla X2070
6 GB VRAM • 177 GB/s
NVIDIA
NVIDIA Tesla X2090
6 GB VRAM • 177 GB/s
NVIDIA
NVIDIA Tesla K20X
6 GB VRAM • 250 GB/s
NVIDIA
NVIDIA Tesla K20Xm
6 GB VRAM • 250 GB/s
NVIDIA
NVIDIA GeForce GTX 1060 6 GB
6 GB VRAM • 192 GB/s
NVIDIA
NVIDIA GeForce GTX 1060 6 GB 9Gbps
6 GB VRAM • 217 GB/s
NVIDIA
NVIDIA GeForce GTX 1060 6 GB GDDR5X
6 GB VRAM • 192 GB/s
NVIDIA
NVIDIA GeForce GTX 1060 6 GB GP104
6 GB VRAM • 192 GB/s
NVIDIA
NVIDIA GeForce GTX 1060 6 GB Rev. 2
6 GB VRAM • 192 GB/s
NVIDIA
NVIDIA GeForce GTX 1660
6 GB VRAM • 192 GB/s
NVIDIA
NVIDIA GeForce GTX 1660 SUPER
6 GB VRAM • 336 GB/s
NVIDIA
NVIDIA GeForce GTX 1660 Ti
6 GB VRAM • 288 GB/s
NVIDIA
NVIDIA GeForce RTX 2060
6 GB VRAM • 336 GB/s
NVIDIA
$140
NVIDIA GeForce RTX 2060 TU104
6 GB VRAM • 336 GB/s
NVIDIA
$140
AMD Radeon RX 5600 OEM
6 GB VRAM • 288 GB/s
AMD
AMD Radeon RX 5600 XT
6 GB VRAM • 288 GB/s
AMD
$90
AMD Radeon RX 5600M
6 GB VRAM • 288 GB/s
AMD

Find the best GPU for MiniCPM-V 2.6 8B

Build Hardware for MiniCPM-V 2.6 8B