Why Is My Local LLM So Slow? Every Fix That Actually Works
Getting 2 tok/s instead of 30? Here are the real causes and real fixes, from GPU offloading to context length to wrong inference engine.
The #1 Cause: Your Model Is Running on CPU
If you're getting 2-10 tokens/second on a GPU that should do 30-100+, the model is almost certainly running on CPU instead of GPU. This is the most common issue on r/LocalLLaMA and the fix is simple:
- Ollama: GPU should be automatic. Run
ollama psto check. If it says "CPU", your GPU drivers aren't installed or CUDA isn't working. - llama.cpp: Add
-ngl 999to offload ALL layers to GPU. Without this flag, everything runs on CPU by default. - LM Studio: Check the "GPU Offload" slider — set it to maximum.
How to verify: Run nvidia-smi while generating. If GPU utilization is 0%, the model isn't using the GPU.
Cause #2: Context Length Is Eating Your VRAM
Your model loaded fine, but after 20 messages it crashes or slows to a crawl. This is the KV cache problem.
Every token in the conversation history takes VRAM. A 7B model at Q4 uses ~3.5 GB for weights, but the KV cache for 32K context adds another ~4.5 GB. After enough back-and-forth, VRAM fills up and the model starts offloading to CPU.
Fix:
- Reduce context:
ollama run model --num-ctx 4096orllama.cpp -c 4096 - Use a smaller quantization to leave more room for KV cache
- Clear the conversation and start fresh
- For long documents, use RAG instead of stuffing everything in context
Cause #3: Wrong Quantization for Your VRAM
If the model barely fits in VRAM, performance tanks because there's no room for KV cache and the system starts swapping to RAM.
Rule of thumb: The model should use at most 80% of your VRAM at idle. The remaining 20% is for KV cache, CUDA context, and overhead.
| VRAM | Max model size (80% rule) | Recommended |
|---|---|---|
| 8 GB | ~6.4 GB | 7B Q4 (3.5 GB) — plenty of room |
| 12 GB | ~9.6 GB | 14B Q4 (8 GB) — tight but works |
| 16 GB | ~12.8 GB | 14B Q6 (10 GB) — comfortable |
| 24 GB | ~19.2 GB | 32B Q4 (18 GB) — good fit |
Cause #4: Ollama vs llama.cpp vs vLLM
Not all inference engines are equal:
- Ollama — easiest setup, good performance, best for single-user. Wraps llama.cpp internally.
- llama.cpp — most control, slightly faster than Ollama (no wrapper overhead). Best for tweaking settings.
- vLLM — designed for multi-user serving with continuous batching. Overkill for personal use but essential for production.
- LM Studio — GUI, easy to use, uses llama.cpp underneath. Good for beginners.
For single-user: Ollama or llama.cpp. For serving to multiple users: vLLM. The speed difference between Ollama and raw llama.cpp is typically <5%.
8 Settings That Fix Most Speed Issues
From XDA Developers' research on the most impactful but rarely-changed settings:
- Flash Attention — Enable it.
-fain llama.cpp. Reduces VRAM usage and speeds up long context. - Context length — Don't use more than you need. 4K is enough for chat.
- Batch size — Higher = faster prefill.
-b 512or more in llama.cpp. - Thread count — For CPU parts, set to your P-core count, not total threads.
- mmap — Enable memory mapping for faster model loading:
--mmap - GPU layers — Always offload ALL layers:
-ngl 999 - KV cache quantization —
-ctk q8_0halves KV cache VRAM with minimal quality loss. - Prompt caching — Reuse previous prompt processing for follow-up questions.
References & Further Reading
- [1]XDA Developers (2026). 8 local LLM settings most people never touch
- [2]InsiderLLM (2026). Local AI Troubleshooting Guide
- [3]Lyx (2026). Context Kills VRAM
- [4]LocalLLM.in (2026). Ollama VRAM Requirements Guide
Find the best model for your hardware
Use FitMyLLM to get personalized recommendations based on your GPU, use case, and speed requirements.
Try FitMyLLMGet weekly updates on new models, GPU deals, and benchmark results.