Qwen2-VL 72B
Model Card
View on HuggingFaceQwen2-VL-72B-Instruct
<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;"> </a>Introduction
We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.
What’s New in Qwen2-VL?
Key Enhancements:
-
SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
-
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
-
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
-
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
Model Architecture Updates:
- Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
- Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
We have three models with 2, 8 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2-VL model. For more information, visit our Blog and GitHub.
Evaluation
Image Benchmarks
| Benchmark | Previous SoTA <sup>(Open-source LVLM)<sup> | Claude-3.5 Sonnet | GPT-4o | Qwen2-VL-72B | :--- | :---: | :---: | :---: | :---: | | MMMU<sub>val</sub> | 58.3 | 68.3 | 69.1 | 64.5 | DocVQA<sub>test</sub> | 94.1 | 95.2 | 92.8 | 96.5 | InfoVQA<sub>test</sub> | 82.0 | - | - | 84.5 | ChartQA<sub>test</sub> | 88.4 | 90.8 | 85.7 | 88.3 | TextVQA<sub>val</sub> | 84.4 | - | - | 85.5 | OCRBench | 852 | 788 | 736 | 877 | MTVQA | 17.3 | 25.7 | 27.8 | 30.9 | VCR<sub>en easy</sub> | 84.67 | 63.85 | 91.55 | 91.93 | VCR<sub>zh easy</sub> | 22.09 | 1.0| 14.87 | 65.37 | RealWorldQA | 72.2 | 60.1 | 75.4 | 77.8 | MME<sub>sum</sub> | 2414.7 | 1920.0 | 2328.7 | 2482.7 | MMBench-EN<sub>test</sub> | 86.5 | 79.7 | 83.4 | 86.5 | MMBench-CN<sub>test</sub> | 86.3 | 80.7 | 82.1 | 86.6 | MMBench-V1.1<sub>test</sub> | 85.5 | 78.5 | 82.2 | 85.9 | MMT-Bench<sub>test</sub> | 63.4 | - | 65.5 | 71.7 | MMStar | 67.1 | 62.2 | 63.9 | 68.3 | MMVet<sub>GPT-4-Turbo</sub> | 65.7 | 66.0 | 69.1 | 74.0 | HallBench<sub>avg</sub> | 55.2 | 49.9 | 55.0 | 58.1 | MathVista<sub>testmini</sub> | 67.5 | 67.7 | 63.8 | 70.5 | MathVision | 16.97 | - | 30.4 | 25.9
Video Benchmarks
| Benchmark | Previous SoTA <sup>(Open-source LVLM)<sup> | Gemini 1.5-Pro | GPT-4o | Qwen2-VL-72B | :--- | :---: | :---: | :---: | :---: | | MVBench | 69.6 | - | - | 73.6 | PerceptionTest<sub>test</sub> | 66.9 | - | - | 68.0 | EgoSchema<sub>test</sub> | 62.0 | 63.2 | 72.2 | 77.9 | Video-MME <sub>(wo/w subs)</sub> | 66.3/69.6 | 75.0/81.3 | 71.9/77.2 | 71.2/77.8
Agent Benchmarks
| Benchmark | Metric | Previous SoTA | GPT-4o | Qwen2-VL-72B | |
|---|---|---|---|---|---|
| General | FnCall<sup>[1]</sup> | TM | - | 90.2 | 93.1 |
| EM | - | 50.0 | 53.2 | ||
| Game | Number Line | SR | 89.4<sup>[2]</sup> | 91.5 | 100.0 |
| BlackJack | SR | 40.2<sup>[2]</sup> | 34.5 | 42.6 | |
| EZPoint | SR | 50.0<sup>[2]</sup> | 85.5 | 100.0 | |
| Point24 | SR | 2.6<sup>[2]</sup> | 3.0 | 4.5 | |
| Android | AITZ | TM | 83.0<sup>[3]</sup> | 70.0 | 89.6 |
| EM | 47.7<sup>[3]</sup> | 35.3 | 72.1 | ||
| AI2THOR | ALFRED<sub>valid-unseen</sub> | SR | 67.7<sup>[4]</sup> | - | 67.8 |
| GC | 75.3<sup>[4]</sup> | - | 75.8 | ||
| VLN | R2R<sub>valid-unseen</sub> | SR | 79.0 | 43.7<sup>[5]</sup> | 51.7 |
| REVERIE<sub>valid-unseen</sub> | SR | 61.0 | 31.6<sup>[5]</sup> | 31.0 |
SR, GC, TM and EM are short for success rate, goal-condition success, type match and exact match. ALFRED is supported by SAM<sup>[6]</sup>.
- Self-Curated Function Call Benchmark by Qwen Team
- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
- Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
- MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
- Segment Anything.
Multilingual Benchmarks
<table style="width:75%; text-align:center;"> <tr> <th>Models</th> <td>AR </td> <td>DE </td> <td>FR </td> <td>IT </td> <td>JA </td> <td>KO </td> <td>RU </td> <td>TH </td> <td>VI </td> <td>AVG</td> </tr> <tr> <th align="left">Qwen2-VL-72B</th> <td>20.7 </td> <td>36.5 </td> <td>44.1 </td> <td>42.8 </td> <td>21.6 </td> <td>37.4 </td> <td>15.6 </td> <td>17.7 </td> <td>41.6 </td> <td><b>30.9</b></td> </tr> <tr> <th align="left">GPT-4o</th> <td>20.2 </td> <td>34.2 </td> <td>41.2 </td> <td>32.7 </td> <td>20.0 </td> <td>33.9 </td> <td>11.5 </td> <td>22.5 </td> <td>34.2 </td> <td>27.8</td> </tr> <tr> <th align="left">Claude3 Opus</th> <td>15.1 </td> <td>33.4 </td> <td>40.6 </td> <td>34.4 </td> <td>19.4 </td> <td>27.2 </td> <td>13.0 </td> <td>19.5 </td> <td>29.1 </td> <td>25.7 </td> </tr> <tr> <th align="left">Gemini Ultra</th> <td>14.7 </td> <td>32.3 </td> <td>40.0 </td> <td>31.8 </td> <td>12.3 </td> <td>17.2 </td> <td>11.8 </td> <td>20.3 </td> <td>28.6 </td> <td>23.2</td> </tr> </table>Requirements
The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:
KeyError: 'qwen2_vl'
Quickstart
We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
pip install qwen-vl-utils
Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen2-VL-72B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
...
Quantizations & VRAM
Benchmarks (11)
GPUs that can run this model
At Q4_K_M quantization. Sorted by minimum VRAM.
Find the best GPU for Qwen2-VL 72B
Build Hardware for Qwen2-VL 72B