Why most LLM VRAM calculators are wrong on modern models (and an open-source MIT fix)
DEV Community Grade 9 9d ago

Why most LLM VRAM calculators are wrong on modern models (and an open-source MIT fix)

🔗 Try it — free, no signup: fitllm.run ⭐ Open source (MIT, one file): github.com/click6067-ship-it/fitllm-engine Most "can I run this LLM?" calculators estimate the KV cache with the textbook formula: KV ≈ 2 × layers × kv_heads × head_dim × context × bytes It assumes every layer keeps a full-context KV cache with one head shape . True for Llama-1/2 — wrong for most 2025–2026 models: Gemma 4 is a 5:1 sliding-window:global interleave — most layers only hold the last 1024 tokens, and global layers use a different head shape. token-proportional KV. MoE keeps every expert resident even if only a few activate per token. So the naive number overcounts the KV-cache term — ~4× on Qwen 3.6, ~11× on Gemma 4 31B at long context — enough to flip "won't fit" into "fits". (A second common slip: applying the GGUF weight quant to the KV cache — llama.cpp keeps KV at f16 by default; weight bits ≠ KV bits.) FitLLM reads each model's official config.json live and models sliding-window / linear / global / MoE layers separately — it reproduces Gemma 4 31B's published 20.78 GiB full-context KV. Covers Apple Silicon and NVIDIA RTX , and you can paste any Hugging Face model id. It's an estimator, not ground truth (tok/s especially is bandwidth-bound). The whole calculation engine is one readable MIT file , so you can audit the math, fork it, or PR a correction: 👉 https://github.com/click6067-ship-it/fitllm-engine Try it: https://fitllm.run

🔗 Try it — free, no signup: fitllm.run ⭐ Open source (MIT, one file): github.com/click6067-ship-it/fitllm-engine Most "can I run this LLM?" calculators estimate the KV cache with the textbook formula: KV ≈ 2 × layers × kv_heads × head_dim × context × bytes It assumes every layer keeps a full-context KV cache with one head shape. True for Llama-1/2 — wrong for most 2025–2026 models: - Gemma 4 is a 5:1 sliding-window:global interleave — most layers only hold the last 1024 tokens, and global layers use a different head shape. token-proportional KV. - MoE keeps every expert resident even if only a few activate per token. So the naive number overcounts the KV-cache term — ~4× on Qwen 3.6, ~11× on Gemma 4 31B at long context — enough to flip "won't fit" into "fits". (A second common slip: applying the GGUF weight quant to the KV cache — llama.cpp keeps KV at f16 by default; weight bits ≠ KV bits.) FitLLM reads each model's official config.json live and models sliding-window / linear / global / MoE layers separately — it reproduces Gemma 4 31B's published 20.78 GiB full-context KV. Covers Apple Silicon and NVIDIA RTX, and you can paste any Hugging Face model id. It's an estimator, not ground truth (tok/s especially is bandwidth-bound). The whole calculation engine is one readable MIT file, so you can audit the math, fork it, or PR a correction: 👉 https://github.com/click6067-ship-it/fitllm-engine Try it: https://fitllm.run Top comments (0)

Comments

No comments yet. Start the discussion.