▸ METHODOLOGY

How the numbers work.

How we calculate performance estimates, costs, and recommendations.

Model Size Estimation

The memory footprint of a model depends on its parameter count and quantization level:

VRAM_MB = parameters_B × bits_per_weight / 8 × 1024 + overhead

Quantization levels:

Level	Bits/Weight	Quality	7B VRAM
Q4_K_M	4.5	94%	~4.5 GB
Q6_K	6.5	97%	~6.3 GB
Q8_0	8.0	99.5%	~7.7 GB
FP16	16.0	100%	~14.8 GB

Token Generation Speed

Token generation (decode) is memory-bandwidth bound:

tokens_per_sec = memory_bandwidth_GBps / model_size_GB × efficiency

The efficiency factor varies by GPU tier, calibrated against real llama.cpp benchmarks from XiongjieDai/GPU-Benchmarks and llama.cpp CUDA benchmarks:

GPU Tier	Efficiency	Examples
Entry-level (<400 GB/s)	0.53	RTX 4060 Ti, RTX 3060
Mid-range (400-700 GB/s)	0.47	RTX 4070, RTX 5070, RTX 4080
High-end (700-1200 GB/s)	0.45	RTX 4090, RTX 3090, RTX 5080
Ultra (>1200 GB/s)	0.36	RTX 5090, RTX Pro 6000
Apple Silicon	0.40-0.70	M1-M4 (varies by chip tier)

Entry-level GPUs show higher efficiency because small models fit entirely in L2 cache. Datacenter GPUs show lower single-user efficiency because they are optimized for batched inference.

Other GPU vendors: AMD GPUs (Vulkan backend) achieve ~0.50 efficiency. Intel Arc GPUs achieve ~0.18 due to immature llama.cpp support.

Context length scaling: Longer contexts slow down decode due to KV cache attention overhead. Our base efficiency is calibrated at 16K context. We apply a log-linear scaling factor:

context_factor = 1.0 - 0.22 × log₂(context / 16384)

Context	Factor	RTX 4090 8B Q4
4K	1.40	~141 tok/s
16K	1.00	~101 tok/s
32K	0.78	~79 tok/s
65K	0.56	~57 tok/s
131K	0.34	~34 tok/s

Calibrated from Hardware Corner RTX 4090 benchmarks (Qwen3 8B Q4, March 2026). Factor clamped to [0.25, 1.40].

Examples:

GPU	Bandwidth	70B Q4 (35GB)	7B Q4 (3.5GB)
RTX 4090	1,008 GB/s	~13 tok/s	~101 tok/s
RTX 5090	1,792 GB/s	~18 tok/s	~143 tok/s
RTX 3060	360 GB/s	—	~42 tok/s

Prefill Speed (Prompt Processing)

Processing the input prompt is compute-bound:

prefill_tok/s = (FP16_TFLOPS × 10¹² × utilization) / (params_B × 2 × 10⁹)

Real-world utilization is much lower than theoretical due to attention O(n²) complexity, memory bandwidth contention, and kernel overhead:

• With tensor cores (consumer GPU): ~4%
• Without tensor cores: ~1.5%
• Datacenter (>500 TFLOPS): ~6%

Capped at realistic maximums: 500 (consumer), 1000 (>100 TFLOPS), 3000 (datacenter). This determines Time to First Token (TTFT) — how long before generation starts.

KV Cache

The KV cache stores attention states and grows with context length and concurrent users:

kv_cache_per_user = 0.5 × √(params_B / 7) × extra_context_K

For enterprise deployments with multiple concurrent users, KV cache is multiplied by the number of users per replica. This is often the limiting factor for how many users a GPU can serve.

Enterprise TCO Calculation

The Enterprise deployment planner compares three options: pay-per-use APIs, cloud GPU rental, and on-premise hardware.

On-Premise Running Costs

electricity = power_kW × hours/month × $0.12/kWh

cooling = electricity × 0.4 (PUE 1.4)

maintenance = hardware_cost × 8% / 12

network = $20 base + $50/replica

total_monthly = electricity + cooling + maintenance + network

Break-Even Calculation

break_even_months = hardware_cost / (cloud_monthly - onprem_monthly)

The interactive break-even chart lets you drag a time slider to visualize when self-hosting becomes cheaper than API or cloud GPU rental.

Multi-GPU & Tensor Parallelism

With NVLink

Effective VRAM: 95%
Bandwidth scaling: 90%
Best for: datacenter GPUs (H100, A100)

Without NVLink (PCIe)

Effective VRAM: 85%
Bandwidth bonus: ~30% per GPU
Best for: consumer GPUs (RTX 3090, 4090)

Scoring System

We use a quality-first multiplicative model. Unlike additive systems (quality + speed + quant), multiplication ensures quality always dominates rankings — speed only penalizes truly slow configurations:

score = qualityScore × speedViability × quantAdjust × 1.4

Why multiplicative? With additive scoring, a tiny fast model can outscore a much smarter but slower model. Multiplicative scoring means a fast 2B model can never beat a 9B just because it generates tokens faster — speed matters only when it becomes a usability problem.

1. Quality Score

Benchmarks are normalized to random baseline (following HuggingFace Open LLM Leaderboard v2). This prevents benchmarks with different chance levels from being compared directly — GPQA 37% barely beats 25% random, but IFEval 37% is meaningful:

normalized = (raw − random_baseline) / (100 − random_baseline) × 100

Benchmark	Random Baseline	Why
IFEval	0%	Generative (no correct guess)
MMLU-PRO	10%	10-choice multiple choice
BBH	25%	~25% avg across subtasks
GPQA	25%	4-choice multiple choice
MUSR	30%	~30% avg across subtasks
MATH	0%	Generative exact match
HumanEval	0%	Generative code completion

2. Arena Elo Blend

For models with LMSYS Chatbot Arena ratings (6M+ human preference votes), we blend 70% benchmarks + 30% Arena Elo. When benchmark coverage is poor, Arena weight increases up to 50% to rescue models with strong real-world quality but incomplete benchmark data.

3. Speed Viability

Speed is a non-linear multiplier (0 → 1.0), not a competing score. Once speed is comfortable, more speed barely matters:

Speed	Interactive	Batch	Feel
3 tok/s	0.45	0.60	Painful
8 tok/s	0.65	0.76	Slow but usable
20 tok/s	0.85	0.94	Comfortable
40 tok/s	0.95	0.98	Fast
60+ tok/s	0.99	1.00	Instant

Interactive = chat, creative, roleplay. Batch = coding, reasoning, agentic, embedding.

4. Quantization & Coverage

Quantization is applied as a natural quality multiplier (Q4 = 0.94×, Q8 = 0.995×, FP16 = 1.0×), not a separate competing component.

Coverage penalty: Models with fewer benchmarks get penalized (2/4 = 57.5% without Arena, 70% with Arena). Arena Elo reduces the penalty because it independently validates quality.

Data Sources

•Model benchmarks: Open LLM Leaderboard, EvalPlus, BigCodeBench (updated weekly)
•GPU specifications: Official NVIDIA, AMD, Apple, Intel spec sheets (121 GPUs, 68 CPUs)
•GPU prices: MSRP for current-gen GPUs. Discontinued GPUs show no price (outdated MSRP would be misleading)
•Cloud GPU pricing: RunPod, Vast.ai, Lambda, AWS, GCP, Azure (updated bi-monthly)
•API pricing: OpenAI, Anthropic, Google (manual verification recommended)
•VRAM formulas: Based on llama.cpp memory estimation, calibrated against real measurements
•KV cache scaling: vLLM/PagedAttention paper (Kwon et al., 2023)
•Tensor parallelism: Megatron-LM efficiency scaling (Shoeybi et al., 2020)

Data Validation

All automated data updates pass through validation guardrails:

• GPU prices must be within $80–$100K range
• Price swings >50% from previous value are rejected
• Model benchmark scores must be 0–100
• Duplicate model IDs are detected and deduplicated
• Model count guardrail prevents accidental data wipe
• All validation failures block the update — no bad data is committed

Limitations

• KV cache formula is simplified; actual size depends on architecture (GQA, MQA, MLA)
• Speed estimates assume single-user inference; batched serving is different
• MoE models: we load all parameters but only measure active parameter speed
• Real performance varies ±20% by inference engine (Ollama vs vLLM vs llama.cpp)
• Prefill utilization (4%) is calibrated against RTX 4090; may differ for other hardware classes
• Multi-GPU efficiency differs between Find Models (PCIe model) and Build for Model / Enterprise (tensor parallelism model)
• API prices are updated bi-monthly and may become stale between updates