Deep Dive8 min read2026-03-08

Why Is My Local LLM So Slow? Every Fix That Actually Works

Getting 2 tok/s instead of 30? Here are the real causes and real fixes, from GPU offloading to context length to wrong inference engine.

Contents

1. The #1 Cause: Your Model Is Running on CPU
2. Cause #2: Context Length Is Eating Your VRAM
3. Cause #3: Wrong Quantization for Your VRAM
4. Cause #4: Ollama vs llama.cpp vs vLLM
5. 8 Settings That Fix Most Speed Issues

The #1 Cause: Your Model Is Running on CPU

If you're getting 2-10 tokens/second on a GPU that should do 30-100+, the model is almost certainly running on CPU instead of GPU. This is the most common issue on r/LocalLLaMA and the fix is simple:

Ollama: GPU should be automatic. Run ollama ps to check. If it says "CPU", your GPU drivers aren't installed or CUDA isn't working.
llama.cpp: Add -ngl 999 to offload ALL layers to GPU. Without this flag, everything runs on CPU by default.
LM Studio: Check the "GPU Offload" slider — set it to maximum.

How to verify: Run nvidia-smi while generating. If GPU utilization is 0%, the model isn't using the GPU.

Cause #2: Context Length Is Eating Your VRAM

Your model loaded fine, but after 20 messages it crashes or slows to a crawl. This is the KV cache problem.

Every token in the conversation history takes VRAM. A 7B model at Q4 uses ~3.5 GB for weights, but the KV cache for 32K context adds another ~4.5 GB. After enough back-and-forth, VRAM fills up and the model starts offloading to CPU.

Fix:

Reduce context: ollama run model --num-ctx 4096 or llama.cpp -c 4096
Use a smaller quantization to leave more room for KV cache
Clear the conversation and start fresh
For long documents, use RAG instead of stuffing everything in context

Cause #3: Wrong Quantization for Your VRAM

If the model barely fits in VRAM, performance tanks because there's no room for KV cache and the system starts swapping to RAM.

Rule of thumb: The model should use at most 80% of your VRAM at idle. The remaining 20% is for KV cache, CUDA context, and overhead.

VRAM	Max model size (80% rule)	Recommended
8 GB	~6.4 GB	7B Q4 (3.5 GB) — plenty of room
12 GB	~9.6 GB	14B Q4 (8 GB) — tight but works
16 GB	~12.8 GB	14B Q6 (10 GB) — comfortable
24 GB	~19.2 GB	32B Q4 (18 GB) — good fit

Cause #4: Ollama vs llama.cpp vs vLLM

Not all inference engines are equal:

Ollama — easiest setup, good performance, best for single-user. Wraps llama.cpp internally.
llama.cpp — most control, slightly faster than Ollama (no wrapper overhead). Best for tweaking settings.
vLLM — designed for multi-user serving with continuous batching. Overkill for personal use but essential for production.
LM Studio — GUI, easy to use, uses llama.cpp underneath. Good for beginners.

For single-user: Ollama or llama.cpp. For serving to multiple users: vLLM. The speed difference between Ollama and raw llama.cpp is typically <5%.

8 Settings That Fix Most Speed Issues

From XDA Developers' research on the most impactful but rarely-changed settings:

Flash Attention — Enable it. -fa in llama.cpp. Reduces VRAM usage and speeds up long context.
Context length — Don't use more than you need. 4K is enough for chat.
Batch size — Higher = faster prefill. -b 512 or more in llama.cpp.
Thread count — For CPU parts, set to your P-core count, not total threads.
mmap — Enable memory mapping for faster model loading: --mmap
GPU layers — Always offload ALL layers: -ngl 999
KV cache quantization — -ctk q8_0 halves KV cache VRAM with minimal quality loss.
Prompt caching — Reuse previous prompt processing for follow-up questions.

References & Further Reading

[1]XDA Developers (2026). 8 local LLM settings most people never touch
[2]InsiderLLM (2026). Local AI Troubleshooting Guide
[3]Lyx (2026). Context Kills VRAM
[4]LocalLLM.in (2026). Ollama VRAM Requirements Guide

Find the best model for your hardware

Use FitMyLLM to get personalized recommendations based on your GPU, use case, and speed requirements.

Try FitMyLLM

▸ DISPATCH

The weekly briefing.

New models · GPU deals · benchmark updates. Once a week. Unsubscribe with one click.