DEV Community
Grade 9
9d ago
Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching
The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt. At Amalgafy Labs , we built the Micro-Expert-Router (MER) to challenge this assumption. We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines. Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing pure host CPU execution , the engine delivered a sustained 21.38 Tokens Per Second (TPS) over a massive 5,000-token context window. The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER . The Evaluation Substrate The benchmark was executed inside an isolated virtual machine environment under strict compute constraints: Compute Engine: Pure Host CPU execution using native AVX-512 vector extensions . Zero active GPU VRAM or Tensor Cores were utilized for the Feed-Forward Network (FFN) layers. Memory Footprint: Standard cloud instance profile allocated with 128 GB of System RAM . Storage Substrate: Attached Local NVMe SSD bypassing standard OS file system overhead via kernel-level io_uring and O_DIRECT asynchronous queues. Target Model: Mixtral 8x7B (MoE architecture, 46.7B total parameters, Top-2 expert routing per token step). Precision Format: 4-bit quantization layout ( dtype=q4_0 ). Test Profile: Continuous execution over 5,000 tokens (Seed: 12648430 ). Raw Telemetry Logs text 2026-06-04T15:10:41.520446ZINFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001 2026-06-04T15:10:41.520511ZINFO ===================== run summary ===================== 2026-06-04T15:10:41.520519ZINFO experts: 256 (top-2), cache=256 slots, pool=258 slots 2026-06-04T15:10:41.520522ZINFO ffn shape:d_model=4096d_ff=14336bytes/expert=99090432 (dtype=q4_0) 2026-06-04T15:10:41.520534ZINFO lookups: hits=9746misses=254hit_rate=97.46% 2026-06-04T15:10:41.520540ZINFO prefetches:completed=2predictor_observations=19996 2026-06-04T15:10:41.520546ZINFO i/o: reads=254bytes=24193.00 MiB 2026-06-04T15:10:41.520557ZINFO i/o latency: p50=116543usp95=233599usp99=360191us 2026-06-04T15:10:41.520563ZINFO compute: p50=40255usp95=41631usp99=60735us(SwiGLU FFN per token) 2026-06-04T15:10:41.520569ZINFO cycle latency: p50=40287usp95=42047usp99=286975usmax=431615us 2026-06-04T15:10:41.520576ZINFO per-token avg: io_wait=5772.7uscompute=40850.5us(over 5000 tokens) 2026-06-04T15:10:41.520582ZINFO I/O share: 12.37% of token cycle time spent waiting on SSD reads 2026-06-04T15:10:41.520588ZINFO energy knobs:dtype=q4_0partial_load_fraction=1.00pinned=0alias_redirects=0 2026-06-04T15:10:41.520595ZINFO =======================================================
The current consensus in AI infrastructure is unyielding: if you want to run frontier Mixture of Experts (MoE) models at usable human-reading inference speeds, you must pay the VRAM premium. The entire model footprint is traditionally pinned into high-bandwidth GPU memory arrays to prevent execution pipelines from grinding to a halt. At Amalgafy Labs, we built the Micro-Expert-Router (MER) to challenge this assumption. We wanted to prove that with low-level systems engineering, an intelligent software abstraction layer can turn cheap, abundant, commodity CPU-heavy cloud shapes into high-throughput inference engines. Yesterday, we took the engine out of the "proven on paper" phase and validated it on live cloud silicon. Running Mixtral 8x7B (47B parameters, q4_0 quantization) on a standard virtual machine utilizing pure host CPU execution, the engine delivered a sustained 21.38 Tokens Per Second (TPS) over a massive 5,000-token context window. The full source code is now open-source on GitHub: randyap8-wq/Micro-Expert-Router-SSD-Streamed-MoE-MER. The Evaluation Substrate The benchmark was executed inside an isolated virtual machine environment under strict compute constraints: - Compute Engine: Pure Host CPU execution using native AVX-512 vector extensions. Zero active GPU VRAM or Tensor Cores were utilized for the Feed-Forward Network (FFN) layers. - Memory Footprint: Standard cloud instance profile allocated with 128 GB of System RAM. - Storage Substrate: Attached Local NVMe SSD bypassing standard OS file system overhead via kernel-level io_uring andO_DIRECT asynchronous queues. - Target Model: Mixtral 8x7B (MoE architecture, 46.7B total parameters, Top-2 expert routing per token step). - Precision Format: 4-bit quantization layout ( dtype=q4_0 ). - Test Profile: Continuous execution over 5,000 tokens (Seed: 12648430 ). Raw Telemetry Logs text 2026-06-04T15:10:41.520446Z INFO stream complete wall_s=233.828879846 sustained_tps=21.383158501605987 avg_throughput_mibps=103.46455072587074 hit_rate_pct=97.46000000000001 2026-06-04T15:10:41.520511Z INFO ===================== run summary ===================== 2026-06-04T15:10:41.520519Z INFO experts: 256 (top-2), cache=256 slots, pool=258 slots 2026-06-04T15:10:41.520522Z INFO ffn shape: d_model=4096 d_ff=14336 bytes/expert=99090432 (dtype=q4_0) 2026-06-04T15:10:41.520534Z INFO lookups: hits=9746 misses=254 hit_rate=97.46% 2026-06-04T15:10:41.520540Z INFO prefetches: completed=2 predictor_observations=19996 2026-06-04T15:10:41.520546Z INFO i/o: reads=254 bytes=24193.00 MiB 2026-06-04T15:10:41.520557Z INFO i/o latency: p50=116543us p95=233599us p99=360191us 2026-06-04T15:10:41.520563Z INFO compute: p50=40255us p95=41631us p99=60735us (SwiGLU FFN per token) 2026-06-04T15:10:41.520569Z INFO cycle latency: p50=40287us p95=42047us p99=286975us max=431615us 2026-06-04T15:10:41.520576Z INFO per-token avg: io_wait=5772.7us compute=40850.5us (over 5000 tokens) 2026-06-04T15:10:41.520582Z INFO I/O share: 12.37% of token cycle time spent waiting on SSD reads 2026-06-04T15:10:41.520588Z INFO energy knobs: dtype=q4_0 partial_load_fraction=1.00 pinned=0 alias_redirects=0 2026-06-04T15:10:41.520595Z INFO ======================================================= Top comments (0)
Comments
No comments yet. Start the discussion.