Questions & straight answers.
Everything you need to know about running LLMs locally.
Everything you need to know about running LLMs locally.
A local LLM (Large Language Model) runs entirely on your own computer, without sending data to external servers. This gives you complete privacy, offline access, and zero API costs. Popular tools for running local LLMs include Ollama, llama.cpp, vLLM, LM Studio, KoboldCpp, and Jan.
Local LLMs offer: 1) Privacy — your data never leaves your machine, 2) No recurring costs — once downloaded, models are free to use forever, 3) Offline access — works without internet, 4) No rate limits — run as many queries as your hardware allows, 5) Customization — fine-tune models for your specific needs, 6) Sovereignty — no content policies or vendor lock-in.
The top open-source models in 2025 (Llama 4, Qwen 3.5, DeepSeek R1/V3, Gemma 3) match or exceed GPT-4o on many benchmarks. For coding, Qwen 3.5 and DeepSeek V3.1 are competitive with Claude. For everyday tasks like chat, writing, and coding assistance, modern local models are excellent. The quality depends heavily on model size — a 70B model significantly outperforms a 7B model.
Quantization reduces model precision to save memory. A model in FP16 (16-bit) takes 2 bytes per parameter, while Q4 (4-bit) takes only ~0.5 bytes — a 4x reduction. This lets you run larger models on limited hardware. Quality tradeoff: Q8 is nearly lossless (~0.5% degradation), Q4_K_M has minimal quality loss (~6%), Q3 shows noticeable degradation. For most users, Q4_K_M is the sweet spot of size vs quality.
MoE models like Mixtral, DeepSeek V3, and Qwen 3 MoE activate only a subset of parameters per token. For example, DeepSeek V3 has 671B total parameters but only uses ~37B per token, giving 70B-like quality with much faster inference. The tradeoff: you still need VRAM for ALL parameters. MoE models are ideal if you have enough VRAM but want faster speed.
Rule of thumb: VRAM (GB) ≈ Parameters (B) × Bits / 8. For example, a 7B model at Q4 needs ~3.5GB VRAM. Practical guide: 8GB VRAM → up to 7B models, 12GB → up to 13B models, 16GB → up to 14B models, 24GB → up to 30B models at Q4, 48GB+ → 70B models. You can also offload to system RAM, but inference is much slower.
Token generation is memory-bandwidth bound, not compute-bound. For each token generated, the entire model must be read from memory. Your speed (tok/s) ≈ Bandwidth (GB/s) / Model Size (GB). This is why the RTX 3090 and 4090 have similar LLM performance — their memory bandwidth is comparable. The RTX 5090 is faster because it has 1,792 GB/s bandwidth vs 1,008 GB/s for the 4090.
Yes! AMD GPUs work well with llama.cpp (via ROCm or Vulkan) and Ollama. The RX 7900 XTX (24GB) and new RX 9070 XT are popular choices. Performance is typically 80-90% of equivalent NVIDIA GPUs. The main limitation is software support — some inference engines have better NVIDIA optimization, but the gap is closing.
Absolutely! Apple Silicon (M1-M4) is excellent for local LLMs. The unified memory architecture means ALL system RAM is available as VRAM — an M4 Max with 128GB RAM can run 70B models at full quality. Metal Performance Shaders provide good acceleration. Ollama and llama.cpp have native Apple Silicon support. The M4 Ultra with 384GB unified memory can even run 400B+ models.
CPU inference works but is 5-20x slower than GPU — typically 1-15 tok/s vs 30-100+ on GPU. Viable for: 1) Testing models before investing in a GPU, 2) Running models too large for any available GPU, 3) Servers with lots of RAM but no GPU. Modern CPUs with AVX-512 or AMX (Intel 13th/14th gen) provide significant speedups. Apple Silicon unified memory offers a middle ground.
One big GPU is almost always better. Multi-GPU adds complexity and overhead (70-90% efficiency). However, multi-GPU is necessary for models that exceed single-GPU VRAM (e.g., 70B at Q8 needs ~72GB). With NVLink: ~90% efficiency. Without (PCIe only): ~70-85%. For most users, a single RTX 4090 (24GB) or 5090 (32GB) is the best choice.
Top coding models (2025): 1) Qwen 2.5 Coder 32B — best overall for code, 2) DeepSeek R1 Distill 32B — excellent reasoning for debugging, 3) Qwen 3.5 — strong all-around with code, 4) Llama 3.3 70B — good for large VRAM setups. Look for HumanEval and BigCodeBench scores. For limited VRAM, Qwen 2.5 Coder 7B is surprisingly capable.
Best general assistants (2025): 1) Qwen 3.5 9B — excellent quality/size ratio, 2) Llama 3.3 70B — top-tier for 70B class, 3) Gemma 3 12B/27B — strong reasoning, 4) MiMo 7B — impressive for its size. Look for "Instruct" variants for chat. For reasoning tasks, DeepSeek R1 distills are hard to beat.
Context length determines how much text the model can process at once. 4K tokens ≈ 3,000 words. Guidelines: Chat — 4-8K is enough. Coding — 16-32K for larger codebases. Document analysis — 32K+. RAG — 8-16K per query. Some models support 128K+ (Qwen 3.5 up to 1M), but longer context uses proportionally more VRAM for KV cache.
Key benchmarks explained: MMLU-PRO — general knowledge and reasoning, MATH — mathematical problem solving, HumanEval/MBPP — code generation ability, IFEval — instruction following accuracy, BBH — complex reasoning tasks, GPQA — graduate-level science questions, BigCodeBench — real-world coding tasks. All scores are 0-100 (higher is better). Our tool weights benchmarks differently per use case.
Self-hosting typically pays off when: 1) You have >100 daily users (API costs add up fast), 2) You process sensitive data (privacy/compliance requirements), 3) You need predictable costs (no per-token billing), 4) You want zero latency to a third party. Our Enterprise planner calculates the break-even point — typically 3-8 months depending on usage volume.
Our Total Cost of Ownership includes: 1) Hardware upfront (GPU purchase), 2) Electricity (at $0.12/kWh), 3) Cooling overhead (PUE 1.4 = 40% extra), 4) Maintenance (8% of hardware cost per year), 5) Network costs. This is compared against cloud GPU rental (RunPod, AWS, etc.) and pay-per-use APIs (OpenAI, Anthropic, Google) to find the cheapest option.
We scrape retail prices daily from Newegg and Amazon for 66 consumer GPUs. Prices are validated: any single-day swing >50% is rejected to filter scraping errors. MSRP from official sources is shown as fallback when no scraped data is available. For datacenter GPUs (H100, A100), we use list prices which may differ from negotiated enterprise pricing.
Our estimates are based on the theoretical formula: tok/s = bandwidth / model_size. Real-world performance varies ±20% depending on: inference engine (Ollama vs vLLM vs llama.cpp), batch size, context length, and system load. We provide community benchmarks alongside theoretical estimates — real-world data from users with the same hardware for validation.
Model data: scraped weekly from HuggingFace with benchmarks from Open LLM Leaderboard, EvalPlus, and BigCodeBench. GPU data: 121 GPUs with specs from official datasheets. Prices: scraped from Newegg/Amazon, stored in Supabase for historical tracking. All data is validated by automated guardrails before any update is committed.
We maintain a curated list of 121 GPUs (NVIDIA, AMD, Intel, Apple Silicon). If yours is missing, use "Manual Entry" to input your VRAM and bandwidth specs. We add new GPUs when they become widely available — the list is updated automatically via workflows.
Automatically via GitHub Actions: Models & benchmarks — every Monday. GPU retail prices — every Wednesday. GPU price history — daily. Cloud/API pricing — 1st and 15th of each month. All updates go through data validation and create a Pull Request for manual review before merging.