vLLM vs llama.cpp vs Ollama: What Happens When Your Model Doesn't Fit in 24GB VRAM
TL;DR
Benchmarked llama.cpp, Ollama, and vLLM across 5 models (1B to 116.8B params) on one RTX 3090 (24GB) + 128GB RAM home-lab box, priced through HomeLab Monitor.
Inside 24GB, vLLM's continuous batching scales aggregate throughput 3.9x-5.4x from concurrency 1 to 8 (llama.cpp only manages 1.2x-1.9x, even with -np 8 explicitly set to match). Past 24GB - two models deliberately chosen to force RAM-spill - llama.cpp and Ollama both degrade to single-digit tok/s and keep generating. vLLM OOMs outright on both, at the same ~22.1-22.2GB-used / <700MB-free ceiling, regardless of quantization scheme.
Sub-plot: llama.cpp's manually-tuned layer offload beats Ollama's automatic split by 37x on time-to-first-token during RAM-spill, while landing on nearly identical steady-state decode speed.
The Roster
| Model | Vendor | Type | Fits in 24GB? |
|---|---|---|---|
| Gemma 3 1B | dense | yes | |
| Qwen3-Coder 30B-A3B | Alibaba | MoE (~3.3B active) | yes |
| Gemma 4 26B-A4B | MoE (~4B active) | yes | |
| GLM-4.5-Air 106B-A12B | Zhipu | MoE (~12B active) | no, deliberately |
| GPT-OSS 120B-A5.1B | OpenAI | MoE (~5.1B active) | no, deliberately |
(Gemma 4 is real - Google's newest release as of this writing, not a Gemma 3 typo.)
3 prompt tiers (short/medium/long), concurrency 1 and 8, 2 reps per cell, 15 backend×model pairs total.
Caveat stated up front: the first three models ran against my production Ollama (OLLAMA_NUM_PARALLEL=1, serialized by default - real daily-use config); GLM and GPT-OSS ran against a separate isolated instance (OLLAMA_NUM_PARALLEL=4) since they needed a clean volume anyway. Ollama's concurrency=8 numbers for the first three models are not its concurrency ceiling - they're its actual default production behavior.
Concurrency, Inside 24GB
Aggregate decode tok/s, concurrency 1 → concurrency 8:
| Model | Ollama | llama.cpp | vLLM |
|---|---|---|---|
| Gemma 3 1B | 125.6 → 71.4 | 294.1 → 400.6 | 235.5 → 1172.1 |
| Qwen3-Coder 30B-A3B | 129.3 → 108.4 | 157.2 → 183.9 | 172.0 → 677.9 |
| Gemma 4 26B-A4B | 84.5 → 78.5 | 118.8 → 220.6 | 133.8 → 723.4 |
vLLM's own c1→c8 scaling: 3.9x-5.4x (paged attention, requests slot into idle cycles). llama.cpp's, even with -np 8 matched to the concurrency level: 1.2x-1.9x - it pre-declares a fixed KV-cache reservation per parallel slot before the server starts, so concurrency is a config decision, not a runtime one.
Head-to-head at c8: vLLM beats llama.cpp by 2.9x-3.7x, beats Ollama's serialized default by 6.3x-16.4x (caveat above applies).
The Cliff, and vLLM's Wall
GLM-4.5-Air (~52% of layers spilled to system RAM under llama.cpp's tuning) and GPT-OSS-120B (~67% spilled) were picked specifically to not fit. llama.cpp and Ollama both ran them - slow, single-digit tok/s, but real generation, no crash.
vLLM failed outright on both:
# GPT-OSS-120B, native MXFP4, --cpu-offload-gb 45
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB.
GPU 0 has a total capacity of 23.56 GiB of which 533.69 MiB is free.
Process ... has 22.21 GiB memory in use.
RuntimeError: Engine core initialization failed.
# GLM-4.5-Air, pre-quantized AWQ, --cpu-offload-gb 36
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.16 GiB.
GPU 0 has a total capacity of 23.56 GiB of which 685.69 MiB is free.
Process ... has 22.12 GiB memory in use.
Same shape, different model, different quantization path. I retried GLM at --gpu-memory-utilization 0.78 (down from 0.90, to force more declared headroom) - got the byte-for-byte identical error: 22.12 GiB used, 685.69 MiB free, 1.16 GiB requested. That rules out the utilization knob as the fix; the base weight + offload footprint is already pinned at the ceiling before profiling starts.
Two models, two quant schemes, same ~22GB wall - reads as a real limit of vLLM's CPU-offload path for >100B-param MoE on one 24GB card on this stack, not a per-model quirk.
TTFT: The 37x Gap That Steady-State Doesn't Show
On the models that ran everywhere, steady-state decode is nearly a tie once warmed up - GPT-OSS-120B's longest tier: 7.65 tok/s (llama.cpp) vs 7.6 tok/s (Ollama). GLM: 4.58 vs 4.59.
Time-to-first-token is a different story:
| Model | Ollama TTFT | llama.cpp TTFT | Gap |
|---|---|---|---|
| GLM-4.5-Air | 13.6s | 8.1s | 1.7x |
| GPT-OSS-120B | 274.0s | 7.3s | 37x |
llama.cpp's -ngl is a number I computed myself from the model's real config.json (layer count, per-layer size) - -ngl 12 for GPT-OSS, offloading ~21GB deliberately. Ollama figures the split out automatically at load time, and on a freshly-pulled, partially-RAM-resident 65GB model, that automatic path is expensive. Same destination, very different path there.
What It Costs (BGN per 1M Output Tokens, Real GPU Energy)
| Model | Ollama | llama.cpp | vLLM |
|---|---|---|---|
| Gemma 3 1B | 0.19 | 0.05 | ~0* |
| Gemma 4 26B-A4B | 0.25 | 0.14 | 0.04 |
| Qwen3-Coder 30B-A3B | 0.16 | 0.13 | 0.04 |
| GLM-4.5-Air | 2.61 | 1.95 | OOM |
| GPT-OSS-120B | 10.00 | 1.43 | OOM |
*vLLM's Gemma 3 1B run finished in 6s - too fast for the power sampler to catch a reading, recorded near-zero. A sampling limitation on short bursts, not a genuine free result.
GPT-OSS-120B on Ollama costs ~7x more real electricity per million tokens than llama.cpp for the identical model - the TTFT convenience tax from above, showing up again in currency.
Three Disclosed vLLM Checkpoint Swaps
The original plan was on-the-fly bitsandbytes 4-bit quant for every vLLM leg. It failed for every MoE model, for three distinct, verified reasons - not the same error copy-pasted three times:
Qwen3-Coder-30B:
ValueError: BitsAndBytes quantization with padded hidden_size ... Parameter shape (786432, 1) != checkpoint shape (2048, 768)- bnb can't dequantize this MoE's padded expert layout. Fix: pre-quantized AWQ checkpoint. Ran clean after (677.9 tok/s aggregate @ c8).Gemma 4 26B-A4B:
AttributeError: MoE Model Gemma4ForConditionalGeneration does not support BitsAndBytes quantization yet.A new architecture, bnb path not wired up yet. Fix: a different pre-quantized checkpoint - which then hit a pydantic error because itsconfig.jsonsayscompressed-tensors, notAWQ, despite the repo name. Fixed by dropping the explicit--quantizationflag entirely and letting vLLM auto-detect.GLM-4.5-Air: not a failure - a practicality call. Skipped a 212GB native bf16 download to test a bnb+MoE+CPU-offload combo the vLLM community already flagged as shaky, went straight to a ~63GB pre-quantized AWQ checkpoint that tests the exact same question.
Every root cause above came from the actual container logs, not from assuming precedent carried over from the previous model's failure.
What Wasn't Tested
- Only two
--gpu-memory-utilizationvalues before accepting the OOM as final, not a full--cpu-offload-gbsweep. - No multi-GPU / tensor-parallel vLLM path - a different question from "does single-card CPU offload work."
- Ollama's c8 numbers for the first three models are its production default, not its concurrency ceiling.
- One raw llama.cpp per-request timing (Gemma 4, medium tier, c8) self-reported an impossible 250,024 tok/s from a near-zero-duration completion - the aggregate figures used throughout are total-tokens-over-wall-time, which isn't corrupted by that, but it's a known rough edge in the raw per-request logs.
Full narrative version, with the RAM-spill mechanics and the redacted dashboard screenshot: on Medium.
Every number above was priced through HomeLab Monitor - open source, MIT licensed - against the RTX 3090's real power draw.
If you're already running one of these three backends: has yours ever tried to load something that just didn't fit - and did it fail loud or fail quiet?
Comments
No comments yet. Start the discussion.