Question 1

What is a local LLM?

Accepted Answer

A local LLM (Large Language Model) runs entirely on your own computer, without sending data to external servers. This gives you complete privacy, offline access, and zero API costs. Popular tools for running local LLMs include Ollama, llama.cpp, vLLM, LM Studio, KoboldCpp, and Jan.

Question 2

Why run LLMs locally instead of using ChatGPT or Claude?

Accepted Answer

Local LLMs offer: 1) Privacy — your data never leaves your machine, 2) No recurring costs — once downloaded, models are free to use forever, 3) Offline access — works without internet, 4) No rate limits — run as many queries as your hardware allows, 5) Customization — fine-tune models for your specific needs, 6) Sovereignty — no content policies or vendor lock-in.

Question 3

Are local LLMs as good as GPT-4o or Claude?

Accepted Answer

The top open-source models in 2025 (Llama 4, Qwen 3.5, DeepSeek R1/V3, Gemma 3) match or exceed GPT-4o on many benchmarks. For coding, Qwen 3.5 and DeepSeek V3.1 are competitive with Claude. For everyday tasks like chat, writing, and coding assistance, modern local models are excellent. The quality depends heavily on model size — a 70B model significantly outperforms a 7B model.

Question 4

What is quantization?

Accepted Answer

Quantization reduces model precision to save memory. A model in FP16 (16-bit) takes 2 bytes per parameter, while Q4 (4-bit) takes only ~0.5 bytes — a 4x reduction. This lets you run larger models on limited hardware. Quality tradeoff: Q8 is nearly lossless (~0.5% degradation), Q4_K_M has minimal quality loss (~6%), Q3 shows noticeable degradation. For most users, Q4_K_M is the sweet spot of size vs quality.

Question 5

What is Mixture of Experts (MoE)?

Accepted Answer

MoE models like Mixtral, DeepSeek V3, and Qwen 3 MoE activate only a subset of parameters per token. For example, DeepSeek V3 has 671B total parameters but only uses ~37B per token, giving 70B-like quality with much faster inference. The tradeoff: you still need VRAM for ALL parameters. MoE models are ideal if you have enough VRAM but want faster speed.

Question 6

How much VRAM do I need?

Accepted Answer

Rule of thumb: VRAM (GB) ≈ Parameters (B) × Bits / 8. For example, a 7B model at Q4 needs ~3.5GB VRAM. Practical guide: 8GB VRAM → up to 7B models, 12GB → up to 13B models, 16GB → up to 14B models, 24GB → up to 30B models at Q4, 48GB+ → 70B models. You can also offload to system RAM, but inference is much slower.

Question 7

Why does GPU memory bandwidth matter more than compute?

Accepted Answer

Token generation is memory-bandwidth bound, not compute-bound. For each token generated, the entire model must be read from memory. Your speed (tok/s) ≈ Bandwidth (GB/s) / Model Size (GB). This is why the RTX 3090 and 4090 have similar LLM performance — their memory bandwidth is comparable. The RTX 5090 is faster because it has 1,792 GB/s bandwidth vs 1,008 GB/s for the 4090.

Question 8

Can I run LLMs on AMD GPUs?

Accepted Answer

Yes! AMD GPUs work well with llama.cpp (via ROCm or Vulkan) and Ollama. The RX 7900 XTX (24GB) and new RX 9070 XT are popular choices. Performance is typically 80-90% of equivalent NVIDIA GPUs. The main limitation is software support — some inference engines have better NVIDIA optimization, but the gap is closing.

Question 9

Can I run LLMs on Apple Silicon?

Accepted Answer

Absolutely! Apple Silicon (M1-M4) is excellent for local LLMs. The unified memory architecture means ALL system RAM is available as VRAM — an M4 Max with 128GB RAM can run 70B models at full quality. Metal Performance Shaders provide good acceleration. Ollama and llama.cpp have native Apple Silicon support. The M4 Ultra with 384GB unified memory can even run 400B+ models.

Question 10

Is CPU-only inference viable?

Accepted Answer

CPU inference works but is 5-20x slower than GPU — typically 1-15 tok/s vs 30-100+ on GPU. Viable for: 1) Testing models before investing in a GPU, 2) Running models too large for any available GPU, 3) Servers with lots of RAM but no GPU. Modern CPUs with AVX-512 or AMX (Intel 13th/14th gen) provide significant speedups. Apple Silicon unified memory offers a middle ground.

Question 11

Should I get multiple GPUs or one big GPU?

Accepted Answer

One big GPU is almost always better. Multi-GPU adds complexity and overhead (70-90% efficiency). However, multi-GPU is necessary for models that exceed single-GPU VRAM (e.g., 70B at Q8 needs ~72GB). With NVLink: ~90% efficiency. Without (PCIe only): ~70-85%. For most users, a single RTX 4090 (24GB) or 5090 (32GB) is the best choice.

Question 12

Which model is best for coding?

Accepted Answer

Top coding models (2025): 1) Qwen 2.5 Coder 32B — best overall for code, 2) DeepSeek R1 Distill 32B — excellent reasoning for debugging, 3) Qwen 3.5 — strong all-around with code, 4) Llama 3.3 70B — good for large VRAM setups. Look for HumanEval and BigCodeBench scores. For limited VRAM, Qwen 2.5 Coder 7B is surprisingly capable.

Question 13

Which model is best for general chat?

Accepted Answer

Best general assistants (2025): 1) Qwen 3.5 9B — excellent quality/size ratio, 2) Llama 3.3 70B — top-tier for 70B class, 3) Gemma 3 12B/27B — strong reasoning, 4) MiMo 7B — impressive for its size. Look for "Instruct" variants for chat. For reasoning tasks, DeepSeek R1 distills are hard to beat.

Question 14

What context length do I need?

Accepted Answer

Context length determines how much text the model can process at once. 4K tokens ≈ 3,000 words. Guidelines: Chat — 4-8K is enough. Coding — 16-32K for larger codebases. Document analysis — 32K+. RAG — 8-16K per query. Some models support 128K+ (Qwen 3.5 up to 1M), but longer context uses proportionally more VRAM for KV cache.

Question 15

What do the benchmark scores mean?

Accepted Answer

Key benchmarks explained: MMLU-PRO — general knowledge and reasoning, MATH — mathematical problem solving, HumanEval/MBPP — code generation ability, IFEval — instruction following accuracy, BBH — complex reasoning tasks, GPQA — graduate-level science questions, BigCodeBench — real-world coding tasks. All scores are 0-100 (higher is better). Our tool weights benchmarks differently per use case.

Question 16

When does self-hosting make sense vs using APIs?

Accepted Answer

Self-hosting typically pays off when: 1) You have >100 daily users (API costs add up fast), 2) You process sensitive data (privacy/compliance requirements), 3) You need predictable costs (no per-token billing), 4) You want zero latency to a third party. Our Enterprise planner calculates the break-even point — typically 3-8 months depending on usage volume.

Question 17

What does the Enterprise TCO calculation include?

Accepted Answer

Our Total Cost of Ownership includes: 1) Hardware upfront (GPU purchase), 2) Electricity (at $0.12/kWh), 3) Cooling overhead (PUE 1.4 = 40% extra), 4) Maintenance (8% of hardware cost per year), 5) Network costs. This is compared against cloud GPU rental (RunPod, AWS, etc.) and pay-per-use APIs (OpenAI, Anthropic, Google) to find the cheapest option.

Question 18

How accurate are the GPU price tracker prices?

Accepted Answer

We scrape retail prices daily from Newegg and Amazon for 66 consumer GPUs. Prices are validated: any single-day swing >50% is rejected to filter scraping errors. MSRP from official sources is shown as fallback when no scraped data is available. For datacenter GPUs (H100, A100), we use list prices which may differ from negotiated enterprise pricing.

Question 19

How accurate are the speed estimates?

Accepted Answer

Our estimates are based on the theoretical formula: tok/s = bandwidth / model_size. Real-world performance varies ±20% depending on: inference engine (Ollama vs vLLM vs llama.cpp), batch size, context length, and system load. We provide community benchmarks alongside theoretical estimates — real-world data from users with the same hardware for validation.

Question 20

Where does the data come from?

Accepted Answer

Model data: scraped weekly from HuggingFace with benchmarks from Open LLM Leaderboard, EvalPlus, and BigCodeBench. GPU data: 121 GPUs with specs from official datasheets. Prices: scraped from Newegg/Amazon, stored in Supabase for historical tracking. All data is validated by automated guardrails before any update is committed.

Question 21

Why is my GPU not in the list?

Accepted Answer

We maintain a curated list of 121 GPUs (NVIDIA, AMD, Intel, Apple Silicon). If yours is missing, use "Manual Entry" to input your VRAM and bandwidth specs. We add new GPUs when they become widely available — the list is updated automatically via workflows.

Question 22

How often is the data updated?

Accepted Answer

Automatically via GitHub Actions: Models & benchmarks — every Monday. GPU retail prices — every Wednesday. GPU price history — daily. Cloud/API pricing — 1st and 15th of each month. All updates go through data validation and create a Pull Request for manual review before merging.

Questions & straight answers.

Read the methodology.

All Questions About Running LLMs Locally