Methodology
How we calculate performance estimates, costs, and recommendations.
Model Size Estimation
The memory footprint of a model depends on its parameter count and quantization level:
Quantization levels:
| Level | Bits/Weight | Quality | 7B VRAM |
|---|---|---|---|
| Q4_K_M | 4.5 | 94% | ~4.5 GB |
| Q6_K | 6.5 | 97% | ~6.3 GB |
| Q8_0 | 8.0 | 99.5% | ~7.7 GB |
| FP16 | 16.0 | 100% | ~14.8 GB |
Token Generation Speed
Token generation (decode) is memory-bandwidth bound:
Examples:
| GPU | Bandwidth | 70B Q4 (35GB) | 7B Q4 (3.5GB) |
|---|---|---|---|
| RTX 4090 | 1,008 GB/s | ~29 tok/s | ~288 tok/s |
| RTX 5090 | 1,792 GB/s | ~51 tok/s | ~512 tok/s |
| H100 80GB | 3,350 GB/s | ~96 tok/s | ~957 tok/s |
Prefill Speed (Prompt Processing)
Processing the input prompt is compute-bound:
Real-world utilization is much lower than theoretical due to attention O(n²) complexity, memory bandwidth contention, and kernel overhead:
- • With tensor cores (consumer GPU): ~4%
- • Without tensor cores: ~1.5%
- • Datacenter (>500 TFLOPS): ~6%
Capped at realistic maximums: 500 (consumer), 1000 (>100 TFLOPS), 3000 (datacenter). This determines Time to First Token (TTFT) — how long before generation starts.
KV Cache
The KV cache stores attention states and grows with context length and concurrent users:
For enterprise deployments with multiple concurrent users, KV cache is multiplied by the number of users per replica. This is often the limiting factor for how many users a GPU can serve.
Enterprise TCO Calculation
The Enterprise deployment planner compares three options: pay-per-use APIs, cloud GPU rental, and on-premise hardware.
On-Premise Running Costs
Break-Even Calculation
The interactive break-even chart lets you drag a time slider to visualize when self-hosting becomes cheaper than API or cloud GPU rental.
Multi-GPU & Tensor Parallelism
With NVLink
- Effective VRAM: 95%
- Bandwidth scaling: 90%
- Best for: datacenter GPUs (H100, A100)
Without NVLink (PCIe)
- Effective VRAM: 85%
- Bandwidth bonus: ~30% per GPU
- Best for: consumer GPUs (RTX 3090, 4090)
Scoring System
Each model score combines three factors: benchmark quality, speed, and quantization fidelity — weighted by use case:
| Use Case | Quality | Speed | Quant |
|---|---|---|---|
| Chat | 45% | 30% | 25% |
| Coding | 55% | 25% | 20% |
| Reasoning | 60% | 15% | 25% |
| Creative | 40% | 35% | 25% |
| Vision | 50% | 25% | 25% |
| Roleplay | 35% | 35% | 30% |
| Embedding | 70% | 10% | 20% |
Quantization score is remapped from quality [0.88–1.0] to [0–100]: Q4 = 50, Q6 = 75, Q8 = 96, FP16 = 100.
Coverage penalty: Models with fewer benchmarks get a confidence penalty (1 of 4 benchmarks = 77.5% of potential score, 4 of 4 = 100%).
Data Sources
- •Model benchmarks: Open LLM Leaderboard, EvalPlus, BigCodeBench (updated weekly)
- •GPU specifications: Official NVIDIA, AMD, Apple, Intel spec sheets (121 GPUs, 68 CPUs)
- •GPU prices: MSRP for current-gen GPUs. Discontinued GPUs show no price (outdated MSRP would be misleading)
- •Cloud GPU pricing: RunPod, Vast.ai, Lambda, AWS, GCP, Azure (updated bi-monthly)
- •API pricing: OpenAI, Anthropic, Google (manual verification recommended)
- •VRAM formulas: Based on llama.cpp memory estimation, calibrated against real measurements
- •KV cache scaling: vLLM/PagedAttention paper (Kwon et al., 2023)
- •Tensor parallelism: Megatron-LM efficiency scaling (Shoeybi et al., 2020)
Data Validation
All automated data updates pass through validation guardrails:
- • GPU prices must be within $80–$100K range
- • Price swings >50% from previous value are rejected
- • Model benchmark scores must be 0–100
- • Duplicate model IDs are detected and deduplicated
- • Model count guardrail prevents accidental data wipe
- • All validation failures block the update — no bad data is committed
Limitations
- • KV cache formula is simplified; actual size depends on architecture (GQA, MQA, MLA)
- • Speed estimates assume single-user inference; batched serving is different
- • MoE models: we load all parameters but only measure active parameter speed
- • Real performance varies ±20% by inference engine (Ollama vs vLLM vs llama.cpp)
- • Prefill utilization (4%) is calibrated against RTX 4090; may differ for other hardware classes
- • Multi-GPU efficiency differs between Find Models (PCIe model) and Build for Model / Enterprise (tensor parallelism model)
- • API prices are updated bi-monthly and may become stale between updates