Deep Dive10 min read2026-03-12

RTX 3090 vs 4090 vs 5090: Which One Should You Actually Buy for Local AI?

The most asked question on r/LocalLLaMA, answered with real numbers. Spoiler: the 3090 is still shockingly competitive.

The Question Everyone Asks

Every week on r/LocalLLaMA, someone asks: "Should I buy a used 3090, a 4090, or wait for the 5090?" It's the most common hardware question in the local AI community, and the answer isn't as obvious as you'd think.

The short answer: it depends on your budget and which models you want to run. But the long answer involves understanding why LLM inference is fundamentally different from gaming or training.

The Numbers That Actually Matter

SpecRTX 3090RTX 4090RTX 5090
VRAM24 GB GDDR6X24 GB GDDR6X32 GB GDDR7
Bandwidth936 GB/s1,008 GB/s1,792 GB/s
FP16 TFLOPS71165209
Price (new/used)~$800-900 used~$1,600~$2,000
7B Q4 tok/s~267~288~512
70B Q4 tok/s~27~29~51

Notice something shocking? The 3090 and 4090 are within 8% of each other for LLM inference. The 4090 has 2.3x more compute (TFLOPS) but only 8% more bandwidth. Since LLM token generation is bandwidth-bound, all that extra compute sits idle.

The 5090 is the real upgrade — 78% more bandwidth thanks to GDDR7 and a wider bus. Plus 8 GB more VRAM (32 vs 24), which means 70B models at Q4 fit more comfortably.

The Real-World Decision

Buy a used RTX 3090 ($800-900) if:

  • You're just starting with local AI and want to experiment
  • You mainly run 7B-32B models (Qwen 3, Llama 8B, Mistral)
  • Budget is a priority — you get 92% of 4090 LLM performance for 55% of the price
  • You already have a powerful PSU (350W TDP)

Buy an RTX 4090 ($1,600) if:

  • You also game or do creative work (2.3x more compute than 3090)
  • You want new hardware with warranty
  • You need CUDA 8.9 features (FP8, better FlashAttention)

Buy an RTX 5090 ($2,000) if:

  • You run 70B models regularly and want interactive speed (50+ tok/s)
  • You need 32 GB VRAM for larger models or longer context
  • You want the best single-GPU performance available
  • You plan to keep it for 3+ years

The r/LocalLLaMA consensus: "Used 3090 is the best value in AI hardware, period. The 5090 is the best performance. The 4090 is stuck in the middle."

What About Apple Silicon?

If you need to run 70B+ models at full quality (Q8 or FP16), no NVIDIA consumer card has enough VRAM. A Mac Studio M4 Ultra with 192 GB unified memory can run models that would require 4x RTX 5090s.

The tradeoff: Apple Silicon bandwidth is lower (~546 GB/s for M4 Max vs 1,792 GB/s for 5090), so tokens/second is lower. But it works, which is more than any single NVIDIA card can say for 70B FP16.

For most people though, a 70B model at Q4 on a single 5090 (32 GB, ~51 tok/s) is more practical than the same model at FP16 on Apple Silicon (140 GB, ~15 tok/s). Q4 quality is 94% of FP16.

References & Further Reading

  1. [1]Civil Learning (2026). RTX 3090 vs 4090 vs 5090 for LLMs
  2. [2]CoreLab (2026). Best GPUs for Local LLM Inference (2026 Benchmarks)
  3. [3]SitePoint (2026). Mac vs PC for Local LLMs

Find the best model for your hardware

Use FitMyLLM to get personalized recommendations based on your GPU, use case, and speed requirements.

Try FitMyLLM

Get weekly updates on new models, GPU deals, and benchmark results.

FitMyLLM — Find the best local AI model for your computer.