Zigzag Decoding with AVX-512
Zigzag decoding with AVX-512
17 Jun 2026
I’ve been working on speeding up AVX-512 vertex decoding in meshoptimizer recently; in the process I stumbled upon two optimizations that I did not end up using but I thought they might be fun to write about! The optimizations that actually made it in require some higher level background / explanations that will have to wait until another day :)
Both optimizations discussed here touch upon two new (for me) features of AVX-512 that I haven’t had a chance to experiment with until now, and both apply to the same problem: how to decode zigzag encoded integers.
Zigzag encoding
Suppose that you have a lot of signed integers to encode; and these integers come as a result of delta encoding, representing differences between input values. If the input values are reasonably close to each other, the resulting deltas are small in magnitude, but the sign of the deltas is fairly unpredictable.
We could encode the resulting deltas as is, in their two’s complement format - but that may be inconvenient if we want to encode the deltas using few bytes or bits. Ideally we’d like small deltas to be encoded with a number of bits that scales with their magnitude.
A zigzag encoding gives us that property, and it transforms values such that positive values v are stored as 2*v, and negative values are stored as 2*(~v)+1. So 0 encodes as 0, -1 encodes as 1, 1 encodes as 2, etc. - positive values become even values and negative values become odd values. The sign bit is effectively shifted to the least significant bit.
To decode this, we can directly decode either a positive or a negative value:
(v & 1) == 0 ? (v >> 1) : ~(v >> 1)
… or, use a common branchless decoding function:
(v >> 1) ^ -(v & 1)
The way branchless decoding works is that -(v & 1) is 0 when v is even, and a mask of all 1s when v is odd. Xor-ing with that mask keeps the magnitude (v >> 1) as is if the mask is 0, and inverts it if the mask is all 1s - which is equivalent to the ternary expression above.
The net effect is that the resulting value is a small positive value if the input was a small negative or positive value; we can then encode the result using any variable-width scheme of our choosing: common options include VByte (aka LEB128, which encodes unsigned integer values as 1-5 bytes) and other forms of bit-wise encoding.
SIMD decoding
The branchless decoding function maps in a very straightforward way to any SIMD instruction set of your choosing: assuming you somehow already produced unsigned value v in each lane of a SIMD vector, you can easily write a (width-dependent) function that decodes the original signed value.
For example, using SSE2, and assuming 32-bit integer inputs, we get:
__m128i xs = _mm_and_si128(v, _mm_set1_epi32(1));
__m128i xl = _mm_sub_epi32(_mm_setzero_si128(), xs);
__m128i xr = _mm_srli_epi32(v, 1);
return _mm_xor_si128(xl, xr);
This is a direct translation of the C code above, except that SSE2 doesn’t have an obvious first class way to negate integers, so you need to subtract from 0 instead.
This compiles exactly as you would expect (and also supports any integer width and can use wider AVX2 or AVX-512 vectors in an obvious way):
vpsrld xmm1, xmm0, 1 ; shift by 1
vpandq xmm0, xmm0, xmm2 ; and with 1 (preloaded into xmm2)
vpsubd xmm0, xmm3, xmm0 ; sub from 0 (preloaded into xmm3)
vpxorq xmm0, xmm0, xmm1 ; xor
Every instruction used here has 1 cycle of latency on Zen 4; note that vpandq does not depend on the results of vpsrld, so the cumulative latency of the 4 instructions is 3 cycles.
Masks
The decoding transformation here is quite simple: our branchless formulation is expressed using arithmetic, but really it’s just moving bits around and occasionally flipping them. This feels like something that a sufficiently advanced bit permutation engine in hardware can do easily, and certainly dedicated circuitry here would be much simpler than an adder.
But we make do with what we are given, and what we are given is AVX-512 - which mostly consists of unsurprising instructions (arithmetic, bitwise ops, etc.), and the obvious “surprising” instructions like vpternlogd (which allows creating custom bitwise functions that combine three bits into one using a truth table) do not seem to help here either.
There is, however, a different way we can go which is a little unusual if you’re used to earlier SIMD instruction sets: instead of using a branchless formulation and relying on various instructions that combine the results using bitwise ops, we can try to compile the branching variant directly.
Consider, again, our branch-based ternary expression:
(v & 1) == 0 ? (v >> 1) : ~(v >> 1)
To compute it, we need to produce the “magnitude” v >> 1, then check (v & 1) == 0 condition, and select between the magnitude or the inverted magnitude. In scalar code, this can be expressed as a branch; but because this computation is vectorized, we can’t emit a branch as the condition may vary per SIMD lane.
What we can use instead, though, is AVX-512 predication support, or execution masks. In pseudo-code, this could look like this:
sign = v & 1
mask = sign != 0
res = v >> 1 (predicated by mask)
res = ~res
Naively, this looks like the same 4 instructions: we compute the sign, compare it with 0, shift the original value by 1 to compute the result (for positive values), and invert the result conditionally only for lanes that computed sign as 1.
AVX-512 does not have a bitwise NOT operation, but it has a XOR operation that can be conditional, so we can XOR with -1 instead. This can be directly expressed in AVX-512 like so:
__m128i xs = _mm_and_si128(v, _mm_set1_epi32(1));
__mmask8 m = _mm_cmp_epi32_mask(xs, _mm_setzero_si128(), _MM_CMPINT_NE);
__m128i x = _mm_srli_epi32(v, 1);
return _mm_mask_xor_epi32(x, m, x, _mm_set1_epi32(-1));
Again, by itself this does not seem useful; however, given a sufficiently smart compiler, or a sufficiently deep knowledge of AVX-512 instructions, this can be optimized further.
Specifically, we are doing an and operation and using the result to compute the mask; but, AVX-512 includes a dedicated instruction for this, similar to a scalar test instruction from x86, that performs and, compares the result to 0 and sets the mask when it’s not 0: exactly what we need!
The code above, then, is equivalent to:
__mmask8 m = _mm_test_epi32_mask(v, _mm_set1_epi32(1));
__m128i x = _mm_srli_epi32(v, 1);
return _mm_mask_xor_epi32(x, m, x, _mm_set1_epi32(-1));
… and compiles to just three instructions (k1 here is a special AVX-512 mask register, equivalent to __mmaskN type):
vptestmd k1, xmm1, xmm3
vpsrld xmm0, xmm1, 1
vpxord xmm0{k1}, xmm0, xmm2
Neat! Instead of a register filled with 0, this requires a register filled with -1 in each component, but if this runs in a loop you can usually compute that once and use it for free.
One caveat is that this only works with 32-bit integers, because AVX-512 does not support masked 8 or 16-bit XOR! Fortunately, we need to conditionally invert the value, and instead of computing ~x we can compute -1 - x, which is a conditional subtract that AVX-512 does have for all integer widths:
__mmask8 m = _mm_test_epi16_mask(v, _mm_set1_epi16(1));
__m128i x = _mm_srli_epi16(v, 1);
return _mm_mask_sub_epi16(x, m, _mm_set1_epi16(-1), x);
Are masks faster?
We heroically reduced the instruction count by 1, but not all instructions are created equal. Is the result faster? The answer is, as all good answers are in cases like this, “it depends”.
First, if we look at Zen 4 instruction latency tables: we will note that while masked execution seems to be “free” (vpxor/vpsub with a mask has the same latency and throughput as the unmasked variants), vptestmd latency according to uops.info is 3 cycles; shift does not depend on the result, but the conditional xor/sub does, so our total latency here is 4 cycles - one cycle longer than the original version.
So in cases where your code is latency-bound, this is probably a bad idea. That said, it should be rare that the entire zigzag decoding is on a latency-critical path; even in typical code that takes the decoded value and accumulates it, the latency-critical path is the addition “spine”, and as long as you can read the inputs independently, an extra cycle of latency might not hurt.
In code that decodes many integers in parallel and is throughput bound, this might be faster. If we look at which ports these instructions get dispatched to on Zen 4, we will see something like this for XMM registers:
| RThroughput | FP0 | FP1 | FP2 | FP3 | |
|---|---|---|---|---|---|
vptestm* |
2 | x | x | ||
vpsrl* |
2 | x | x | ||
vpxor* |
4 | x | x | x | x |
vpand* |
4 | x | x | x | x |
vpsub* |
4 | x | x | x | x |
Each instruction here can be dispatched once per cycle per port it can get executed on, but “simpler” instructions can execute on any of the 4 ports, and the two more complex instructions can execute on any of the 2 - critically, vptestm and vpsrl execute on two different sets of ports!
Therefore if we do not have any latency constraints, it might be reasonable to expect that our old combination of vpsrld/vpand/vpsub/vpxor takes around 1 cycle for 4 instructions, whereas our new combination perfectly fills the gaps and might average ~0.75 cycles for 3 instructions.
Of course, this is purely theoretical and also depends on whether we have any other instructions; if the code running alongside zigzag decoding really needs FP0/FP1 this might increase the port pressure and make it slower.
Experimentally, for meshoptimizer:
In vertex codec, this is not very useful because the loop where this can be applied is bottlenecked by a combination of a latency “spine” that accumulates deltas and vector stores that overwhelm the store unit. The rest of the code has enough execution resources: even though the new code requires fewer of them, it does not change the performance.
In meshlet codec, part of the decoding process is an actual throughput-bottlenecked loop (yay!); and in that loop, using this decoding method does indeed improve decoding performance measurably (~3%). However, meshlet codec currently does not have an AVX-512 (or even an AVX2) path; and making the decoding loop wider to match the AVX-512 register width might have a more significant impact.
In addition, this code works fine when compiled with GCC, but Clang decides it is a smart compiler and removes our mask trick entirely! Which ends up producing the exact same code we started from:
vpsrld xmm1, xmm0, 1
vpandd xmm0, xmm0, xmm3
vpsubd xmm0, xmm2, xmm0
vpxor xmm0, xmm1, xmm0
This behavior is simultaneously entirely unhelpful and also typical of Clang: LLVM loves to canonicalize vector intrinsics into forms that are ostensibly easier for the compiler to reason about, and then applies generic optimizations to these that are often either ignoring or misunderstanding the actual hardware bottlenecks. This is just fine for code that starts as scalar code and is auto-vectorized, but is actively harmful in this and many other cases when explicit SIMD intrinsics are used; there are sometimes ways to work around this, e.g. in the meshlet codec the actual expression we need is unzigzag32(v) + 1, which can be restructured a little bit to replace conditional xor with a more complicated conditional sub that LLVM leaves alone.
For all of these reasons this optimization didn’t quite make the cut, although I might revisit this at some point. But there is another interesting optimization we can consider!
GF(2) affine transformations
I’ve already mentioned it above briefly but it bears repeating: unzigzag is a very boring transformation of the underlying bits. If the low bit is 0, we simply shift the entire value right by 1. If the low bit is 1, then the resulting highest bit is 1, and all other bits are just inverted bits of the original value shifted by 1. There must be some circuit that just does it for us!
Enter GFNI, or “Galois Field New Instructions”. This is another extended instruction set; it’s technically not even part of AVX-512: it has a separate feature bit in the CPUID mask and must be checked separately; because of that, it’s also not part of AVX10.2 instruction set. However, all AVX10.2 CPUs support it, all AMD CPUs that support AVX-512 support it, and even though some earlier Intel CPUs (like Cascade Lake) support AVX-512 without supporting GFNI, at this point the existence of consumer Intel CPUs that have AVX-512 is questionable so we might as well assume the best.
Definitely make sure to still check the CPUID bit, but my point is that relying on GFNI in addition to AVX-512 should not restrict our supported systems much in practice.
With that preamble out of the way, what do these instructions actually do? The most important instruction for our use case has a beautiful name vgf2p8affineqb and performs an 8x8 matrix multiplication in GF(2). This will start making sense in a moment.
Let’s take a byte, and represent it as an 8x1 vector x, with every vector component containing a single bit. And let’s take an 8x8 matrix, where each matrix element is also 0 or 1. We can then multiply the matrix A by the vector x using the usual arithmetic rules; then if we take the resulting integer vector and just take the least significant bit of the result, we will produce a new 8x1 vector, y (which can then be represented as a byte).
There are a couple different ways to think about this transformation. On one hand, this is just matrix multiplication, performed using modular arithmetic modulo 2 (by properties of modular arithmetic, it does not matter if you perform intermediate computations modulo 2 or just take the remainder of the final result). This is otherwise known as doing the matrix multiplication over GF(2), where GF(2) is a field that only defines two numbers (0 and 1), and defines arithmetic operations on them. This means the matrix multiplication behaves much as other matrix multiplications do.
Comments
No comments yet. Start the discussion.