Hacker News 2h ago

Integer Quantization: Deep Dive

Comments

A lot has happened in transformer quantization over the past few years, from barely being able to quantize a 7B model in INT8 without destroying accuracy, to routinely fitting a 70B model in 4-bits on a single GPU. But existing guides on the topic are fragmented: either focused on a specific technique or on how to use a library. I’ve been working on integer quantization for fixed-point hardware for a while now and my goal with this series is to bridge that gap: building the core ideas carefully and tracing how the field has evolved, each technique motivated by the problems of what came before. This first post covers the foundations: what quantization is, why it’s hard, and the math behind it. What is Quantization & why should you care?¶ Quantization is the process of representing high-precision values using fewer bits. In practice, this means storing weights and (optionally) activations in lower precision (e.g., int8 instead of fp16), introducing a small approximation error. The most immediate and easy-to-realize benefit of quantization is memory reduction. As a rule of thumb, a model with N billion parameters requires roughly 2 × N GB of memory when stored in 16-bit precision. Quantizing to 8-bit or 4-bit reduces this footprint by 2× and 4×, respectively. There is also a hardware advantage. In 2014, Mark Horowitz, from Stanford University published a paper Computing’s Energy Problem which studied fp operations vs integer operations: Energy Costs for various operations on a 45nm CMOS node. Source: Computing’s Energy Problem So, integer arithmetic consumes lesser energy, specifically int8 add consumes 30x less energy than fp32 add & int8 mul consumes 18x less energy than fp32 mul. Lower precision hardware is also faster & consumes lesser silicon area than floating point. How do these benefits translate to real-world gains? It depends on the bottleneck: - Compute-bound workloads (e.g., CNNs, LLM prefill): Quantization can improve throughput since lower-precision arithmetic is faster and consumes lesser energy. - Memory-bandwidth-bound workloads (e.g., LLM decoding): Quantization reduces the amount of data moved, improving performance by lowering memory bandwidth pressure. By this point, the motivation should be clear: quantization reduces memory, lowers energy consumption, and can improve performance. Next, we will look at the hardware unit that executes fixed point arthmetic. Multiply Accumulate Unit¶ The dominant operation in neural networks is matrix multiplication. Modern hardware accelerators optimize this using specialized units called Multiply–Accumulate (MAC) Units: Matrix–vector computation in neural network accelerator hardware. Source: A White Paper on Neural Network Quantization The diagram represents a typical matrix–vector multiply unit in neural network accelerators. This is the building block for matrix multiplications and convolutions. The two fundamental components are the processing elements (C_{n,m}) and the accumulators (A_n). The computation proceeds as follows: - The accumulators are first initialized with the bias value (b_n) - In the next cycle, weights (W_{n,m}) and input values (x_m) are loaded - Their product is computed at each processing element: $$C_{n,m} = W_{n,m} \cdot x_m$$ - The results are then accumulated: $$A_n = b_n + \sum_{m} C_{n,m}$$ How is quantization done?¶ Starting from a real-valued vector (x), we map it to an integer grid ({x_{\text{int}}^{\min}, \ldots, x_{\text{int}}^{\max}}): Here: - (s) is the scale - (z) is the zero-point (offset) - (\lfloor \cdot \rceil) denotes rounding to the nearest integer The clamp operation ensures the result lies within the valid integer range: So, the idea is to scale and shift the floating-point value, then clamp it to fit within the integer grid. Quantization Simulation (Fake Quantization)¶ Instead of running quantized models directly on target hardware, we often simulate quantization on general-purpose hardware using high-level frameworks like PyTorch. This is commonly referred to as fake quantization. The key idea is simple: we mimic the effects of quantization while still executing operations in floating point. This allows us to study accuracy and perform experiments like Quantization Aware Training (QAT) without requiring specialized hardware. To do this, we: - Quantize the input to an integer grid - Dequantize it back to floating point - Perform all computations in floating point on standard hardware (e.g., GPUs) The dequantization step maps integers back to real values: Combining quantization and dequantization, we get: In practice, frameworks insert these quant–dequant (Q/DQ) pairs around operations in the model graph. While the computation is still carried out in floating point, the values are constrained to a discrete set, effectively simulating quantization effects during inference. What are (q_{\min}) & (q_{\max}) ?¶ At this point, it’ll be helpful to build some intuition around what the quantization formula is actually doing. Think about the minimum value that can come out of the quantization operation. This corresponds to (x_{\text{int}}^{\min}). Plugging this into the de-quantization formula: Similarly, the maximum value corresponds to (x_{\text{int}}^{\max}): Key thing to note here is: we no longer operate over a continuous floating-point range, but over a discrete set of (2^b) values, each separated by the scale (s). Quantization Error¶ There are two “bad guys” in our quantization formula: (1) the rounding operator, which introduces rounding error, and (2) the clamp operator, which introduces clipping error. These are the reasons why a quantized network deviates from the original network, a.k.a: quantization error (or quantization noise). As we’ll see, improving one often worsens the other, so quantization is really about balancing the two. Rounding Error: for a value is the difference between the original floating-point value and the value it gets mapped to on the quantized grid. Clipping Error: occurs when values fall outside the representable range and get clipped to the minimum or maximum value. Let’s think about the extremes again. When would the rounding error be the maximum? The worst case occurs when the fp value lies exactly halfway between two grid points. In this case: So, you might think: “why not make (s) very small to minimize this for each value?”. Unfortunately, there’s no free lunch when it comes to quantization: reducing s makes [(q_{\min} = s \left( x_{\text{int}}^{\min} - z \right)), (q_{\max} = s \left( x_{\text{int}}^{\max} - z \right))] smaller increasing your clipping error. So how do you choose quantization parameters optimally? How are quantization params ( scale and offset) calculated?¶ The simplest approach is min–max quantization, where we use an unsigned integer grid ((0 \text{ to } 2^b - 1)) and set the scale so that the entire floating-point range fits within it, avoiding clipping. That is, you want: (q_{\max}) to map to (fp_{\max}) & (q_{\min}) to map to (fp_{\min}) So you solve: Solving these gives: For a signed grid ((-2^{b-1} \text{ to } 2^{b-1} - 1)), the corresponding method is abs-max quantization, where we scale using the maximum absolute value so that the range is symmetric around zero, avoiding clipping. So you solve: Solving gives: These are the most common formulas for scale and offset on the internet – now you known where they come from! An example of AbsMax Quantization. Author Maarten Grootendorst Notice how some of values in integer grid get wasted with AbsMax Quantization. But what if your tensor contains outliers - i.e., a very small fraction of values have large magnitudes while most are relatively small? If you use min–max or abs-max quantization, these outliers stretch the range of the grid. As a result, most values get mapped to a coarser grid, leading to higher rounding error. Fortunately, there are alternatives: - Loss-aware quantization: Choose the quantization parameters (e.g., range, scale) to minimize a loss between the floating-point and quantized outputs - such as MSE, cross-entropy, or SQNR. - Range clipping: Instead of covering the full range, clip extreme values (e.g., using percentiles) to reduce the impact of outliers and improve effective resolution. Even these do not guarentee that quantization will work well but thankfully we have a few more axes to play with. And that’s exactly what we’ll explore next. Quantization Categories¶ Quantization can be divided based on : Quantization Mapping: Affine vs Symmetric¶ Affine (asymmetric) quantization uses a non-zero zero-point (z), allowing the quantized range to better align with the floating-point distribution. It is commonly used with unsigned integer grids and provides flexibility by shifting where real zero is represented. In contrast, symmetric quantization enforces (z = 0), using a range centered around zero. It is typically paired with signed integer grids and works well when the data distribution is approximately zero-centered. Symmetric vs Asymmertic Quantization. Source: Research Gate Let’s see how this choice affects computation. Recall: When (W_{n,m}) and (x_m) are quantized: Substituing, we get: Source: A White Paper On Neural Network Quantization The first term matches the symmetric case. The third and fourth terms depend only on weights, scales, and offsets, so they can be precomputed and folded into the bias at negligible cost. The second term, however, depends on the input (x), meaning it must be computed at inference time, adding latency and power overhead. Hence, a common strategy is asymmetric activations + symmetric weights, which avoids this data-dependent cost. How Quantization Is Applied (Quantization Granularity)¶ Per Tensor¶ One entire tensor shares a single scale ((s)) and zero-point ((z)). Effective bits per value: Assuming the scale and zero-point are stored in 16 bits each, the effective number of bits per value

Read on Hacker News ↗ ← Back to News