How the numbers work.
How we calculate performance estimates, costs, and recommendations.
Model Size Estimation
The memory footprint of a model depends on its parameter count and quantization level:
Quantization levels:
| Level | Bits/Weight | Quality | 7B VRAM |
|---|---|---|---|
| Q4_K_M | 4.5 | 94% | ~4.5 GB |
| Q6_K | 6.5 | 97% | ~6.3 GB |
| Q8_0 | 8.0 | 99.5% | ~7.7 GB |
| FP16 | 16.0 | 100% | ~14.8 GB |
Token Generation Speed
Token generation (decode) is memory-bandwidth bound:
The efficiency factor varies by GPU tier, calibrated against real llama.cpp benchmarks from XiongjieDai/GPU-Benchmarks and llama.cpp CUDA benchmarks:
| GPU Tier | Efficiency | Examples |
|---|---|---|
| Entry-level (<400 GB/s) | 0.53 | RTX 4060 Ti, RTX 3060 |
| Mid-range (400-700 GB/s) | 0.47 | RTX 4070, RTX 5070, RTX 4080 |
| High-end (700-1200 GB/s) | 0.45 | RTX 4090, RTX 3090, RTX 5080 |
| Ultra (>1200 GB/s) | 0.36 | RTX 5090, RTX Pro 6000 |
| Apple Silicon | 0.40-0.70 | M1-M4 (varies by chip tier) |
Entry-level GPUs show higher efficiency because small models fit entirely in L2 cache. Datacenter GPUs show lower single-user efficiency because they are optimized for batched inference.
Other GPU vendors: AMD GPUs (Vulkan backend) achieve ~0.50 efficiency. Intel Arc GPUs achieve ~0.18 due to immature llama.cpp support.
Context length scaling: Longer contexts slow down decode due to KV cache attention overhead. Our base efficiency is calibrated at 16K context. We apply a log-linear scaling factor:
| Context | Factor | RTX 4090 8B Q4 |
|---|---|---|
| 4K | 1.40 | ~141 tok/s |
| 16K | 1.00 | ~101 tok/s |
| 32K | 0.78 | ~79 tok/s |
| 65K | 0.56 | ~57 tok/s |
| 131K | 0.34 | ~34 tok/s |
Calibrated from Hardware Corner RTX 4090 benchmarks (Qwen3 8B Q4, March 2026). Factor clamped to [0.25, 1.40].
Examples:
| GPU | Bandwidth | 70B Q4 (35GB) | 7B Q4 (3.5GB) |
|---|---|---|---|
| RTX 4090 | 1,008 GB/s | ~13 tok/s | ~101 tok/s |
| RTX 5090 | 1,792 GB/s | ~18 tok/s | ~143 tok/s |
| RTX 3060 | 360 GB/s | — | ~42 tok/s |
Prefill Speed (Prompt Processing)
Processing the input prompt is compute-bound:
Real-world utilization is much lower than theoretical due to attention O(n²) complexity, memory bandwidth contention, and kernel overhead:
- • With tensor cores (consumer GPU): ~4%
- • Without tensor cores: ~1.5%
- • Datacenter (>500 TFLOPS): ~6%
Capped at realistic maximums: 500 (consumer), 1000 (>100 TFLOPS), 3000 (datacenter). This determines Time to First Token (TTFT) — how long before generation starts.
KV Cache
The KV cache stores attention states and grows with context length and concurrent users:
For enterprise deployments with multiple concurrent users, KV cache is multiplied by the number of users per replica. This is often the limiting factor for how many users a GPU can serve.
Enterprise TCO Calculation
The Enterprise deployment planner compares three options: pay-per-use APIs, cloud GPU rental, and on-premise hardware.
On-Premise Running Costs
Break-Even Calculation
The interactive break-even chart lets you drag a time slider to visualize when self-hosting becomes cheaper than API or cloud GPU rental.
Multi-GPU & Tensor Parallelism
With NVLink
- Effective VRAM: 95%
- Bandwidth scaling: 90%
- Best for: datacenter GPUs (H100, A100)
Without NVLink (PCIe)
- Effective VRAM: 85%
- Bandwidth bonus: ~30% per GPU
- Best for: consumer GPUs (RTX 3090, 4090)
Scoring System
We use a quality-first multiplicative model. Unlike additive systems (quality + speed + quant), multiplication ensures quality always dominates rankings — speed only penalizes truly slow configurations:
Why multiplicative? With additive scoring, a tiny fast model can outscore a much smarter but slower model. Multiplicative scoring means a fast 2B model can never beat a 9B just because it generates tokens faster — speed matters only when it becomes a usability problem.
1. Quality Score
Benchmarks are normalized to random baseline (following HuggingFace Open LLM Leaderboard v2). This prevents benchmarks with different chance levels from being compared directly — GPQA 37% barely beats 25% random, but IFEval 37% is meaningful:
| Benchmark | Random Baseline | Why |
|---|---|---|
| IFEval | 0% | Generative (no correct guess) |
| MMLU-PRO | 10% | 10-choice multiple choice |
| BBH | 25% | ~25% avg across subtasks |
| GPQA | 25% | 4-choice multiple choice |
| MUSR | 30% | ~30% avg across subtasks |
| MATH | 0% | Generative exact match |
| HumanEval | 0% | Generative code completion |
2. Arena Elo Blend
For models with LMSYS Chatbot Arena ratings (6M+ human preference votes), we blend 70% benchmarks + 30% Arena Elo. When benchmark coverage is poor, Arena weight increases up to 50% to rescue models with strong real-world quality but incomplete benchmark data.
3. Speed Viability
Speed is a non-linear multiplier (0 → 1.0), not a competing score. Once speed is comfortable, more speed barely matters:
| Speed | Interactive | Batch | Feel |
|---|---|---|---|
| 3 tok/s | 0.45 | 0.60 | Painful |
| 8 tok/s | 0.65 | 0.76 | Slow but usable |
| 20 tok/s | 0.85 | 0.94 | Comfortable |
| 40 tok/s | 0.95 | 0.98 | Fast |
| 60+ tok/s | 0.99 | 1.00 | Instant |
Interactive = chat, creative, roleplay. Batch = coding, reasoning, agentic, embedding.
4. Quantization & Coverage
Quantization is applied as a natural quality multiplier (Q4 = 0.94×, Q8 = 0.995×, FP16 = 1.0×), not a separate competing component.
Coverage penalty: Models with fewer benchmarks get penalized (2/4 = 57.5% without Arena, 70% with Arena). Arena Elo reduces the penalty because it independently validates quality.
Data Sources
- •Model benchmarks: Open LLM Leaderboard, EvalPlus, BigCodeBench (updated weekly)
- •GPU specifications: Official NVIDIA, AMD, Apple, Intel spec sheets (121 GPUs, 68 CPUs)
- •GPU prices: MSRP for current-gen GPUs. Discontinued GPUs show no price (outdated MSRP would be misleading)
- •Cloud GPU pricing: RunPod, Vast.ai, Lambda, AWS, GCP, Azure (updated bi-monthly)
- •API pricing: OpenAI, Anthropic, Google (manual verification recommended)
- •VRAM formulas: Based on llama.cpp memory estimation, calibrated against real measurements
- •KV cache scaling: vLLM/PagedAttention paper (Kwon et al., 2023)
- •Tensor parallelism: Megatron-LM efficiency scaling (Shoeybi et al., 2020)
Data Validation
All automated data updates pass through validation guardrails:
- • GPU prices must be within $80–$100K range
- • Price swings >50% from previous value are rejected
- • Model benchmark scores must be 0–100
- • Duplicate model IDs are detected and deduplicated
- • Model count guardrail prevents accidental data wipe
- • All validation failures block the update — no bad data is committed
Limitations
- • KV cache formula is simplified; actual size depends on architecture (GQA, MQA, MLA)
- • Speed estimates assume single-user inference; batched serving is different
- • MoE models: we load all parameters but only measure active parameter speed
- • Real performance varies ±20% by inference engine (Ollama vs vLLM vs llama.cpp)
- • Prefill utilization (4%) is calibrated against RTX 4090; may differ for other hardware classes
- • Multi-GPU efficiency differs between Find Models (PCIe model) and Build for Model / Enterprise (tensor parallelism model)
- • API prices are updated bi-monthly and may become stale between updates