Microsoft/Dense

Phi-4-multimodal 14B

chatvisionreasoningThinkingTool Use
14B
Parameters
128K
Context length
8
Benchmarks
4
Quantizations
0
Architecture
Dense
Released
2025-01-07
Layers
32
KV Heads
8
Head Dim
128
Family
phi

πŸŽ‰Phi-4: [mini-reasoning | reasoning] | [multimodal-instruct | onnx]; [mini-instruct | onnx]

Model Summary

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:

  • Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
  • Vision: English
  • Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

πŸ“° Phi-4-multimodal Microsoft Blog

πŸ“– Phi-4-multimodal Technical Report

🏑 Phi Portal

πŸ‘©β€πŸ³ Phi Cookbook

πŸ–₯️ Try It on Azure, GitHub, Nvidia, Huggingface playgrounds

πŸ“±Huggingface Spaces Thoughts Organizer, Stories Come Alive, Phine Speech Translator

Watch as Phi-4 Multimodal analyzes spoken language to help plan a trip to Seattle, demonstrating its advanced audio processing and recommendation capabilities.

See how Phi-4 Multimodal tackles complex mathematical problems through visual inputs, demonstrating its ability to process and solve equations presented in images.

Explore how Phi-4 Mini functions as an intelligent agent, showcasing its reasoning and task execution abilities in complex scenarios.

Intended Uses

Primary Use Cases

The model is intended for broad multilingual and multimodal commercial and research use . The model provides uses for general purpose AI systems and applications which require

  1. Memory/compute constrained environments
  2. Latency bound scenarios
  3. Strong reasoning (especially math and logic)
  4. Function and tool calling
  5. General image understanding
  6. Optical character recognition
  7. Chart and table understanding
  8. Multiple image comparison
  9. Multi-image or video clip summarization
  10. Speech recognition
  11. Speech translation
  12. Speech QA
  13. Speech summarization
  14. Audio understanding

The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.

Use Case Considerations

The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models and multimodal models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

Release Notes

This release of Phi-4-multimodal-instruct is based on valuable user feedback from the Phi-3 series. Previously, users could use a speech recognition model to talk to the Mini and Vision models. To achieve this, users needed to use a pipeline of two models: one model to transcribe the audio to text, and another model for the language or vision tasks. This pipeline means that the core model was not provided the full breadth of input information – e.g. cannot directly observe multiple speakers, background noises, jointly align speech, vision, language information at the same time on the same representation space. With Phi-4-multimodal-instruct, a single new open model has been trained across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. The model employed new architecture, larger vocabulary for efficiency, multilingual, and multimodal support, and better post-training techniques were used for instruction following and function calling, as well as additional data leading to substantial gains on key multimodal capabilities. It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!

Model Quality

<details> <summary>Click to view details</summary>

To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:

Speech

The Phi-4-multimodal-instruct was observed as

  • Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large.
  • Ranking number 1 on the Huggingface OpenASR leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of March 04, 2025.
  • Being the first open-sourced model that can perform speech summarization, and the performance is close to GPT4o.
  • Having a gap with close models, e.g. Gemini-1.5-Flash and GPT-4o-realtime-preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.

Speech Recognition (lower is better)

The performance of Phi-4-multimodal-instruct on the aggregated benchmark datasets:

The performance of Phi-4-multimodal-instruct on different languages, averaging the WERs of CommonVoice and FLEURS:

Speech Translation (higher is better)

Translating from German, Spanish, French, Italian, Japanese, Portugues, Chinese to English:

Translating from English to German, Spanish, French, Italian, Japanese, Portugues, Chinese. Noted that WhiperV3 does not support this capability:

Speech Summarization (higher is better)

Speech QA

MT bench scores are scaled by 10x to match the score range of MMMLU:

Audio Understanding

AIR bench scores are scaled by 10x to match the score range of MMAU:

Vision

Vision-Speech tasks

...

Quantizations & VRAM

Q4_K_M4.5 bpw
8.4 GB
VRAM required
94%
Quality
Q6_K6.5 bpw
11.9 GB
VRAM required
97%
Quality
Q8_08 bpw
14.5 GB
VRAM required
100%
Quality
FP1616 bpw
28.5 GB
VRAM required
100%
Quality

Benchmarks (8)

Arena Elo1465
IFEval64.2
BBH49.4
MMLU-PRO40.8
BigCodeBench37.6
MATH19.6
MUSR13.1
GPQA11.5

Run with Ollama

$ollama run phi4:14b

GPUs that can run this model

At Q4_K_M quantization. Sorted by minimum VRAM.

NVIDIA RTX 3080 10GB
10 GB VRAM β€’ 760 GB/s
NVIDIA
$429
Intel Arc B570
10 GB VRAM β€’ 456 GB/s
INTEL
$219
NVIDIA GeForce RTX 3080
10 GB VRAM β€’ 760 GB/s
NVIDIA
$699
AMD Radeon RX 6700
10 GB VRAM β€’ 320 GB/s
AMD
AMD Radeon RX 6700M
10 GB VRAM β€’ 320 GB/s
AMD
AMD Radeon RX 6750 GRE 10 GB
10 GB VRAM β€’ 320 GB/s
AMD
NVIDIA P102-101
10 GB VRAM β€’ 320 GB/s
NVIDIA
AMD Xbox Series X GPU
10 GB VRAM β€’ 560 GB/s
AMD
NVIDIA CMP 170HX 10 GB
10 GB VRAM β€’ 1560 GB/s
NVIDIA
NVIDIA CMP 50HX
10 GB VRAM β€’ 560 GB/s
NVIDIA
NVIDIA CMP 90HX
10 GB VRAM β€’ 760 GB/s
NVIDIA
AMD Xbox Series X 6nm GPU
10 GB VRAM β€’ 560 GB/s
AMD
NVIDIA RTX 2080 Ti
11 GB VRAM β€’ 616 GB/s
NVIDIA
$350
NVIDIA GTX 1080 Ti
11 GB VRAM β€’ 484 GB/s
NVIDIA
$200
NVIDIA GeForce GTX 1080 Ti
11 GB VRAM β€’ 484 GB/s
NVIDIA
NVIDIA GeForce RTX 2080 Ti
11 GB VRAM β€’ 616 GB/s
NVIDIA
$225
NVIDIA RTX 5070
12 GB VRAM β€’ 672 GB/s
NVIDIA
$549
NVIDIA RTX 4070 Ti
12 GB VRAM β€’ 504 GB/s
NVIDIA
$799
NVIDIA RTX 4070 SUPER
12 GB VRAM β€’ 504 GB/s
NVIDIA
$599
NVIDIA RTX 4070
12 GB VRAM β€’ 504 GB/s
NVIDIA
$549
NVIDIA RTX 3080 Ti
12 GB VRAM β€’ 912 GB/s
NVIDIA
$550
NVIDIA RTX 3080 12GB
12 GB VRAM β€’ 912 GB/s
NVIDIA
$599
NVIDIA RTX 3060 12GB
12 GB VRAM β€’ 360 GB/s
NVIDIA
$329
AMD RX 7700 XT
12 GB VRAM β€’ 432 GB/s
AMD
$449
AMD RX 6700 XT
12 GB VRAM β€’ 384 GB/s
AMD
$344
AMD RX 6750 XT
12 GB VRAM β€’ 432 GB/s
AMD
$299
Intel Arc B580
12 GB VRAM β€’ 456 GB/s
INTEL
$249
NVIDIA Tesla K40c
12 GB VRAM β€’ 288 GB/s
NVIDIA
NVIDIA Tesla K40d
12 GB VRAM β€’ 288 GB/s
NVIDIA
NVIDIA Tesla K40m
12 GB VRAM β€’ 288 GB/s
NVIDIA

Find the best GPU for Phi-4-multimodal 14B

Build Hardware for Phi-4-multimodal 14B