FitMyLLM helps you find and run AI models on your own hardware. Enter your GPU — whether it's an NVIDIA RTX 4090, RTX 3090, RTX 3060, AMD RX 7900 XTX, or Apple M4 — and get instant recommendations for the best models that fit your VRAM, with speed estimates and ready-to-run Ollama commands.
Our database covers 330+ open-source LLMs including Llama 4, Qwen 3.5, DeepSeek R1, DeepSeek V3, Gemma 3, Phi-4, Mistral, and more. Each model includes benchmarks (MMLU-PRO, HumanEval, MATH, IFEval), VRAM requirements at every quantization level (Q4_K_M, Q5_K_M, Q6_K, Q8_0, FP16), and compatibility data for 1720 GPUs.
Whether you need an AI for coding (Qwen 2.5 Coder, DeepSeek Coder), creative writing, chat, reasoning (DeepSeek R1), or document analysis with RAG — FitMyLLM finds the optimal model for your specific hardware in seconds.
Running LLMs locally requires GPU VRAM (video memory). The amount depends on model size and quantization: a 7B parameter model at Q4 quantization needs about 4GB VRAM, while a 70B model needs 40GB+. Modern GPUs like the RTX 4060 (8GB), RTX 4070 Ti (12GB), RTX 4080 (16GB), and RTX 4090 (24GB) can run increasingly powerful models.
Speed depends on memory bandwidth, not just compute power. That's why the RTX 3090 (936 GB/s) still competes with the RTX 4090 (1,008 GB/s) for LLM inference. The new RTX 5090 with 1,792 GB/s GDDR7 bandwidth is the fastest consumer GPU for local AI.
Apple Silicon users benefit from unified memory — an M4 Max with 128GB can run 70B models that would require a $2,000+ GPU on PC. FitMyLLM supports all platforms: NVIDIA, AMD, Intel Arc, and Apple M1/M2/M3/M4 chips.
Side-by-side comparison of any models. Benchmark scores, VRAM usage, speed estimates, and radar charts. Compare Llama 4 vs Qwen 3.5, DeepSeek R1 vs Gemma 3, or any combination.
Every GPU ranked S-tier to F-tier for running local AI. Based on VRAM, bandwidth, and real model compatibility data — not opinions. Includes NVIDIA RTX, AMD Radeon, Intel Arc, and Apple Silicon.
Plan production LLM deployments with GPU sizing, P95 latency estimation, and cloud vs on-prem TCO analysis. Supports vLLM, TRT-LLM, and SGLang serving engines.