▸ METHODOLOGY

How the numbers work.

How we calculate performance estimates, costs, and recommendations.

Model Size Estimation

The memory footprint of a model depends on its parameter count and quantization level:

VRAM_MB = parameters_B × bits_per_weight / 8 × 1024 + overhead

Quantization levels:

LevelBits/WeightQuality7B VRAM
Q4_K_M4.594%~4.5 GB
Q6_K6.597%~6.3 GB
Q8_08.099.5%~7.7 GB
FP1616.0100%~14.8 GB

Token Generation Speed

Token generation (decode) is memory-bandwidth bound:

tokens_per_sec = memory_bandwidth_GBps / model_size_GB × efficiency

The efficiency factor varies by GPU tier, calibrated against real llama.cpp benchmarks from XiongjieDai/GPU-Benchmarks and llama.cpp CUDA benchmarks:

GPU TierEfficiencyExamples
Entry-level (<400 GB/s)0.53RTX 4060 Ti, RTX 3060
Mid-range (400-700 GB/s)0.47RTX 4070, RTX 5070, RTX 4080
High-end (700-1200 GB/s)0.45RTX 4090, RTX 3090, RTX 5080
Ultra (>1200 GB/s)0.36RTX 5090, RTX Pro 6000
Apple Silicon0.40-0.70M1-M4 (varies by chip tier)

Entry-level GPUs show higher efficiency because small models fit entirely in L2 cache. Datacenter GPUs show lower single-user efficiency because they are optimized for batched inference.

Other GPU vendors: AMD GPUs (Vulkan backend) achieve ~0.50 efficiency. Intel Arc GPUs achieve ~0.18 due to immature llama.cpp support.

Context length scaling: Longer contexts slow down decode due to KV cache attention overhead. Our base efficiency is calibrated at 16K context. We apply a log-linear scaling factor:

context_factor = 1.0 - 0.22 × log2(context / 16384)
ContextFactorRTX 4090 8B Q4
4K1.40~141 tok/s
16K1.00~101 tok/s
32K0.78~79 tok/s
65K0.56~57 tok/s
131K0.34~34 tok/s

Calibrated from Hardware Corner RTX 4090 benchmarks (Qwen3 8B Q4, March 2026). Factor clamped to [0.25, 1.40].

Examples:

GPUBandwidth70B Q4 (35GB)7B Q4 (3.5GB)
RTX 40901,008 GB/s~13 tok/s~101 tok/s
RTX 50901,792 GB/s~18 tok/s~143 tok/s
RTX 3060360 GB/s~42 tok/s

Prefill Speed (Prompt Processing)

Processing the input prompt is compute-bound:

prefill_tok/s = (FP16_TFLOPS × 1012 × utilization) / (params_B × 2 × 109)

Real-world utilization is much lower than theoretical due to attention O(n²) complexity, memory bandwidth contention, and kernel overhead:

  • • With tensor cores (consumer GPU): ~4%
  • • Without tensor cores: ~1.5%
  • • Datacenter (>500 TFLOPS): ~6%

Capped at realistic maximums: 500 (consumer), 1000 (>100 TFLOPS), 3000 (datacenter). This determines Time to First Token (TTFT) — how long before generation starts.

KV Cache

The KV cache stores attention states and grows with context length and concurrent users:

kv_cache_per_user = 0.5 × √(params_B / 7) × extra_context_K

For enterprise deployments with multiple concurrent users, KV cache is multiplied by the number of users per replica. This is often the limiting factor for how many users a GPU can serve.

Enterprise TCO Calculation

The Enterprise deployment planner compares three options: pay-per-use APIs, cloud GPU rental, and on-premise hardware.

On-Premise Running Costs

electricity = power_kW × hours/month × $0.12/kWh
cooling = electricity × 0.4 (PUE 1.4)
maintenance = hardware_cost × 8% / 12
network = $20 base + $50/replica
total_monthly = electricity + cooling + maintenance + network

Break-Even Calculation

break_even_months = hardware_cost / (cloud_monthly - onprem_monthly)

The interactive break-even chart lets you drag a time slider to visualize when self-hosting becomes cheaper than API or cloud GPU rental.

Multi-GPU & Tensor Parallelism

With NVLink

  • Effective VRAM: 95%
  • Bandwidth scaling: 90%
  • Best for: datacenter GPUs (H100, A100)

Without NVLink (PCIe)

  • Effective VRAM: 85%
  • Bandwidth bonus: ~30% per GPU
  • Best for: consumer GPUs (RTX 3090, 4090)

Scoring System

We use a quality-first multiplicative model. Unlike additive systems (quality + speed + quant), multiplication ensures quality always dominates rankings — speed only penalizes truly slow configurations:

score = qualityScore × speedViability × quantAdjust × 1.4

Why multiplicative? With additive scoring, a tiny fast model can outscore a much smarter but slower model. Multiplicative scoring means a fast 2B model can never beat a 9B just because it generates tokens faster — speed matters only when it becomes a usability problem.

1. Quality Score

Benchmarks are normalized to random baseline (following HuggingFace Open LLM Leaderboard v2). This prevents benchmarks with different chance levels from being compared directly — GPQA 37% barely beats 25% random, but IFEval 37% is meaningful:

normalized = (raw − random_baseline) / (100 − random_baseline) × 100
BenchmarkRandom BaselineWhy
IFEval0%Generative (no correct guess)
MMLU-PRO10%10-choice multiple choice
BBH25%~25% avg across subtasks
GPQA25%4-choice multiple choice
MUSR30%~30% avg across subtasks
MATH0%Generative exact match
HumanEval0%Generative code completion

2. Arena Elo Blend

For models with LMSYS Chatbot Arena ratings (6M+ human preference votes), we blend 70% benchmarks + 30% Arena Elo. When benchmark coverage is poor, Arena weight increases up to 50% to rescue models with strong real-world quality but incomplete benchmark data.

3. Speed Viability

Speed is a non-linear multiplier (0 → 1.0), not a competing score. Once speed is comfortable, more speed barely matters:

SpeedInteractiveBatchFeel
3 tok/s0.450.60Painful
8 tok/s0.650.76Slow but usable
20 tok/s0.850.94Comfortable
40 tok/s0.950.98Fast
60+ tok/s0.991.00Instant

Interactive = chat, creative, roleplay. Batch = coding, reasoning, agentic, embedding.

4. Quantization & Coverage

Quantization is applied as a natural quality multiplier (Q4 = 0.94×, Q8 = 0.995×, FP16 = 1.0×), not a separate competing component.

Coverage penalty: Models with fewer benchmarks get penalized (2/4 = 57.5% without Arena, 70% with Arena). Arena Elo reduces the penalty because it independently validates quality.

Data Sources

Data Validation

All automated data updates pass through validation guardrails:

  • • GPU prices must be within $80–$100K range
  • • Price swings >50% from previous value are rejected
  • • Model benchmark scores must be 0–100
  • • Duplicate model IDs are detected and deduplicated
  • • Model count guardrail prevents accidental data wipe
  • • All validation failures block the update — no bad data is committed

Limitations

  • • KV cache formula is simplified; actual size depends on architecture (GQA, MQA, MLA)
  • • Speed estimates assume single-user inference; batched serving is different
  • • MoE models: we load all parameters but only measure active parameter speed
  • • Real performance varies ±20% by inference engine (Ollama vs vLLM vs llama.cpp)
  • • Prefill utilization (4%) is calibrated against RTX 4090; may differ for other hardware classes
  • • Multi-GPU efficiency differs between Find Models (PCIe model) and Build for Model / Enterprise (tensor parallelism model)
  • • API prices are updated bi-monthly and may become stale between updates