Building a Native 1-Bit LLM Engine in Pure Rust: Achieving 150+ TPS and 350MB Memory Footprint on Edge CPUs. [P]
There's been a ton of academic hype recently around 1-bit quantization, BitNet (1.58b), and pushing LLMs to the absolute edge. I've spent the last few months quietly trying to take this from a theoretical whitepaper into an actual, production-ready reality. I decided to completely bypass PyTorch, `llama.cpp`, BLAS, and CUDA. Instead, I wrote a custom, zero-dependency inference engine entirely from scratch in pure Rust that runs native 1-bit and ternary packed models directly on standard edge CPUs. To prove I'm not just blowing smoke, I recorded an unedited video demo running a heavily distilled `.leviathan2` model (TinyLlama converted to my custom 1-bit format). You can watch it hitting 150+ TPS natively on the CPU while using less than 350MB of RAM. The video showcases three proof models mapping my journey to the Holy Grail of LLM compression: TinyLlama (`tinyllama.leviathan2`) : A raw 1-bit speed test. It uses standard uniform quantization without compensation, which loses coherence, but perfectly proves the engine hits 150+ TPS natively on the CPU. Qwen D-QLoRA (`qwen_fluent.leviathan2`) : A fluent model compiled at **2-bit (Ternary)*\ * precision. It proves the engine natively mounts INT8 dynamic adapters to retain 100% intelligence, though it runs slightly slower. [THE BREAKTHROUGH] TinyLlama Hybrid (`tinyllama_hybrid.leviathan3`) : The absolute Holy Grail. I cracked the math to combine extreme compression with intelligence recovery. This model uses a proprietary algorithm that pushes 1-bit quantization loss forward across the matrix and injects residual error catchers dynamically. The result? **16x compression (1-bit), blazing fast 150+ TPS, AND 100% fluent English intelligence retention. (All three models and their real-time performance metrics are showcased in the video link). What did I actually build? The system has two main parts: a compression pipeline and the native inference engine. The Compression Pipeline (`.leviathan2`) : Standard model weights (like `.safetensors`) are massive and require heavy frameworks just to parse. I wrote a custom compression script that distills standard LLM architectures down into a 1-bit or ternary state. It packs these weights at the sub-byte level and embeds the architectural definitions (GQA, RoPE base, etc.) directly into a contiguous memory-mapped byte buffer. The Rust Inference Engine : The engine loads this `.leviathan2` binary directly into memory without any external frameworks. It performs multiplier-free matrix operations (because the weights are just -1, 0, 1) to generate tokens. Currently supported architectures: - LLaMA-based models (TinyLlama, Llama 2, Llama 3) - Qwen-based models \ (The engine dynamically parses GQA, RoPE Base, and dynamic dimensions at runtime).** How I got it so fast (and bulletproof for Edge AI) I'm keeping the core engine code closed-source for now, but I want to share exactly \ why** it's so fast, stable, and ready for technical due diligence: Multiplier-Free SIMD (AVX2 & NEON): I threw out standard matrix multiplication. The engine uses custom, unrolled `AVX2` pipelines for x86 to perform pure addition and subtraction. For ARM/Macs, I implemented `NEON` SIMD. Because NEON doesn't natively have a "multiply unsigned by signed" like AVX2, I used a hardware maneuver: casting unpacked bits into signed 8-bit integers via `vld1q_s8`, allowing the engine to leverage the powerhouse `vdotq_s32` operation to calculate the dot product of 4 pairs of integers simultaneously. Prefill-GEMM Batching vs Generation: Most engines bottleneck during prompt ingestion (prefilling 4,000 words). I wrote a custom SIMD batching kernel that accepts an aggregated block of tokens, unpacks the 1-bit weights \ only once** into L1 registers, and then projects them across multiple tokens simultaneously. This officially shifts prompt ingestion from a memory-bandwidth bound operation to a highly efficient compute-bound GEMM operation. SpQR Cache-Miss Elimination: To retain high outlier percentages without destroying the CPU L1/L2 cache during sparse matrix operations, I introduced an ascending sort mechanism on the outlier indices during export. This forces the Rust sparse layer to traverse the target output buffer strictly linearly, perfectly synergizing with CPU memory pre-fetchers to completely eliminate cache-miss penalties. Native Multi-Threading: The matrix operations are chunked and split across all physical CPU cores natively using Rayon parallel iterators. It fully saturates edge processors. Zero Heap Fragmentation: The context window doesn't dynamically allocate memory. The engine initializes a statically allocated Ring Buffer on startup. Once the context window fills up, it just natively overwrites the tail using modular indexing. Your RAM stays perfectly flat forever. This also means the engine supports **infinite generation*\ *βonce you hit the 2048 token limit, it seamlessly overwrites the oldest context without crashing. It just keeps talking. Zer
Comments
No comments yet. Start the discussion.