Methodology

How we calculate performance estimates, costs, and recommendations.

Model Size Estimation

The memory footprint of a model depends on its parameter count and quantization level:

VRAM_MB = parameters_B × bits_per_weight / 8 × 1024 + overhead

Quantization levels:

LevelBits/WeightQuality7B VRAM
Q4_K_M4.594%~4.5 GB
Q6_K6.597%~6.3 GB
Q8_08.099.5%~7.7 GB
FP1616.0100%~14.8 GB

Token Generation Speed

Token generation (decode) is memory-bandwidth bound:

tokens_per_sec = memory_bandwidth_GBps / model_size_GB

Examples:

GPUBandwidth70B Q4 (35GB)7B Q4 (3.5GB)
RTX 40901,008 GB/s~29 tok/s~288 tok/s
RTX 50901,792 GB/s~51 tok/s~512 tok/s
H100 80GB3,350 GB/s~96 tok/s~957 tok/s

Prefill Speed (Prompt Processing)

Processing the input prompt is compute-bound:

prefill_tok/s = (FP16_TFLOPS × 1012 × utilization) / (params_B × 2 × 109)

Real-world utilization is much lower than theoretical due to attention O(n²) complexity, memory bandwidth contention, and kernel overhead:

  • • With tensor cores (consumer GPU): ~4%
  • • Without tensor cores: ~1.5%
  • • Datacenter (>500 TFLOPS): ~6%

Capped at realistic maximums: 500 (consumer), 1000 (>100 TFLOPS), 3000 (datacenter). This determines Time to First Token (TTFT) — how long before generation starts.

KV Cache

The KV cache stores attention states and grows with context length and concurrent users:

kv_cache_per_user = 0.5 × √(params_B / 7) × extra_context_K

For enterprise deployments with multiple concurrent users, KV cache is multiplied by the number of users per replica. This is often the limiting factor for how many users a GPU can serve.

Enterprise TCO Calculation

The Enterprise deployment planner compares three options: pay-per-use APIs, cloud GPU rental, and on-premise hardware.

On-Premise Running Costs

electricity = power_kW × hours/month × $0.12/kWh
cooling = electricity × 0.4 (PUE 1.4)
maintenance = hardware_cost × 8% / 12
network = $20 base + $50/replica
total_monthly = electricity + cooling + maintenance + network

Break-Even Calculation

break_even_months = hardware_cost / (cloud_monthly - onprem_monthly)

The interactive break-even chart lets you drag a time slider to visualize when self-hosting becomes cheaper than API or cloud GPU rental.

Multi-GPU & Tensor Parallelism

With NVLink

  • Effective VRAM: 95%
  • Bandwidth scaling: 90%
  • Best for: datacenter GPUs (H100, A100)

Without NVLink (PCIe)

  • Effective VRAM: 85%
  • Bandwidth bonus: ~30% per GPU
  • Best for: consumer GPUs (RTX 3090, 4090)

Scoring System

Each model score combines three factors: benchmark quality, speed, and quantization fidelity — weighted by use case:

score = benchmark_score × wQuality + speed_score × wSpeed + quant_score × wQuant
Use CaseQualitySpeedQuant
Chat45%30%25%
Coding55%25%20%
Reasoning60%15%25%
Creative40%35%25%
Vision50%25%25%
Roleplay35%35%30%
Embedding70%10%20%

Quantization score is remapped from quality [0.88–1.0] to [0–100]: Q4 = 50, Q6 = 75, Q8 = 96, FP16 = 100.

Coverage penalty: Models with fewer benchmarks get a confidence penalty (1 of 4 benchmarks = 77.5% of potential score, 4 of 4 = 100%).

Data Sources

Data Validation

All automated data updates pass through validation guardrails:

  • • GPU prices must be within $80–$100K range
  • • Price swings >50% from previous value are rejected
  • • Model benchmark scores must be 0–100
  • • Duplicate model IDs are detected and deduplicated
  • • Model count guardrail prevents accidental data wipe
  • • All validation failures block the update — no bad data is committed

Limitations

  • • KV cache formula is simplified; actual size depends on architecture (GQA, MQA, MLA)
  • • Speed estimates assume single-user inference; batched serving is different
  • • MoE models: we load all parameters but only measure active parameter speed
  • • Real performance varies ±20% by inference engine (Ollama vs vLLM vs llama.cpp)
  • • Prefill utilization (4%) is calibrated against RTX 4090; may differ for other hardware classes
  • • Multi-GPU efficiency differs between Find Models (PCIe model) and Build for Model / Enterprise (tensor parallelism model)
  • • API prices are updated bi-monthly and may become stale between updates