Reddit - r/programming 1h ago

How 4 bytes of padding make array clearing 49% faster

I wrote about an interesting amd64-specific quirk. If a large array is 4-byte misaligned, making it 8-byte aligned can make the array clearing ~49% faster (at least on my Intel machine).

The Alignment Issue

On x86-64 processors, memory alignment affects performance significantly. When clearing a large array using memset or similar operations, the CPU's REP STOSQ instruction works most efficiently when the destination address is 8-byte aligned.

Consider this scenario:

An array starts at an address that is 4-byte aligned but not 8-byte aligned
Adding just 4 bytes of padding shifts the array to 8-byte alignment
The result: array clearing becomes nearly 50% faster

Performance Measurements

On my Intel machine, the difference was dramatic:

4-byte misaligned array: baseline performance
8-byte aligned array: ~49% faster clearing time

Technical Details

The performance gain comes from how Intel's REP STOSQ implementation handles alignment:

ERMS (Enhanced REP MOVSB/STOSB) : Modern Intel processors include this feature for optimized string operations
Alignment penalties: Misaligned accesses cause the CPU to perform extra memory operations
Cache line boundaries: 8-byte alignment ensures operations align with cache line boundaries

Additional Optimizations for Array Clearing

Beyond alignment, other techniques can improve array clearing performance:

Use memset with the largest word size supported by the architecture
Leverage compiler intrinsics like __mm256_setzero_si256() for SIMD operations
Consider calloc instead of malloc + memset for zero-initialized allocations
Use posix_memalign for guaranteed alignment when needed

The key takeaway: a small 4-byte padding adjustment can yield substantial performance improvements for memory-intensive operations on modern x86-64 processors.

Read on Reddit - r/programming ↗ ← Back to News

How 4 bytes of padding make array clearing 49% faster