Trust your compiler: Modern C++
Fast inverse square root
The Q_rsqrt from Quake III, reproduced here:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
return y;
}
There are numerous articles explaining how this works, which we will skip here. In short, late 90s CPUs floating point units were nothing like today's. During the development of Quake 3 Arena, iD's developers discovered that a lot of time was being spent determining vertex normals used for their new lighting model. Much of this time was being spent in one operation: the inverse square root. Their approach was to forgo accuracy, and take a "estimate and improve" approach for this one operation. It could beat the naive 1.0f / sqrtf(x) by a wide margin: FSQRT and FDIV could take over 500 cycles in the 8087 FPU, while iD's method used cheap integer operations plus one Newton iteration.
Modern CPUs
Intel introduced rsqrtss and rsqrtps to the x86 family of architectures with SSE; AVX later added the vrsqrtss and vrsqrtps forms for wider SIMD widths. These instructions compute an approximate reciprocal square root directly, with a documented error bound and substantially lower latency than x87 fsqrt. ARMv8/AArch64 has comparable reciprocal-square-root estimate instructions:
| ISA | Scalar | Packed | Width |
|---|---|---|---|
| x86-64 SSE | rsqrtss |
rsqrtps |
4 x f32 |
| x86-64 AVX | vrsqrtss |
vrsqrtps |
8 x f32 |
| ARMv8 NEON | frsqrte (scalar) |
frsqrte |
4 x f32 |
Why this matters?
Writing this code in modern C++ might look something like:
constexpr float Q_rsqrt(float number) noexcept
{
static_assert(sizeof(float) == sizeof(std::uint32_t));
auto i = std::bit_cast<std::uint32_t>(number);
auto magic = 0x5f3759dfu - (i >> 1);
auto y = std::bit_cast<float>(magic);
return y * (1.5f - (number * 0.5f * y * y));
}
Compare that with the naive version:
constexpr float naive_rsqrt(float x) noexcept
{
return 1.0f / std::sqrt(x);
}
Now compare the relevant assembly produced with -std=c++23 -O3 -ffast-math -march=znver4 on Compiler Explorer using Clang 21.1.0. The benchmark run used -march=native; the assembly excerpt uses znver4 so the target is explicit. The -ffast-math part matters: it lets the compiler make aggressive, potentially lossy floating-point assumptions, so this example assumes positive, finite inputs and does not preserve every strict IEEE floating-point edge case.
Q_rsqrt(float):
movd eax, xmm0
sar eax
mov ecx, 1597463007
sub ecx, eax
mulss xmm0, dword ptr [rip + .LCPI0_0]
movd xmm1, ecx
movdqa xmm2, xmm1
mulss xmm2, xmm1
mulss xmm0, xmm2
addss xmm0, dword ptr [rip + .LCPI0_1]
mulss xmm0, xmm1
ret
naive_rsqrt(float):
vrsqrtss xmm1, xmm0, xmm0
vmulss xmm0, xmm0, xmm1
vfmadd213ss xmm0, xmm1, dword ptr [rip + .LCPI1_0]
vmulss xmm1, xmm1, dword ptr [rip + .LCPI1_1]
vmulss xmm0, xmm1, xmm0
ret
Benchmarking
This makes sense in theory, but is it actually true now?
| Benchmark | Time (scalar) | Time (n=1024) | Time (n=65536) |
|---|---|---|---|
| Q_sqrt | 380 ns | 24.5 ns | 1865 ns |
| naive_rsqrt | 373 ns | 25.0 ns | 2161 ns |
The scalar case is effectively a draw, with the obvious version slightly ahead. In the array kernels, the Quake version is somewhat faster in this particular run, but not in a way that justifies the trick as a default: the source is less clear, has a narrower useful domain, and is still only competitive because the compiler and CPU are already doing the hard work. Additionally, the naive way provides you with explicit bounds on error.
Take-away
Let the compiler do the work. You get comparable performance, better-defined code, and an implementation that says what it means.
Popcount and bit-twiddling
C++20 gave us std::popcount, std::countl_zero, std::countr_zero, std::bit_width, std::has_single_bit, and friends. On x86, when the relevant target features are enabled, many of these map to a single instruction:
| Function | x86-64 (BMI/POPCNT) | ARMv8 |
|---|---|---|
std::popcount |
popcnt |
cnt + addv |
std::countl_zero |
lzcnt |
clz |
std::countr_zero |
tzcnt |
rbit + clz |
Compare:
constexpr int modern(std::uint64_t x) noexcept
{
return std::popcount(x);
}
constexpr int kernighan(std::uint64_t x) noexcept
{
int c = 0;
while (x != 0U) {
x &= x - 1U;
++c;
}
return c;
}
constexpr int swar(std::uint64_t x) noexcept
{
x = x - ((x >> 1) & 0x5555'5555'5555'5555ULL);
x = (x & 0x3333'3333'3333'3333ULL) + ((x >> 2) & 0x3333'3333'3333'3333ULL);
x = (x + (x >> 4)) & 0x0f0f'0f0f'0f0f'0f0fULL;
return static_cast<int>((x * 0x0101'0101'0101'0101ULL) >> 56);
}
With -march=native and popcnt available, pop_modern can be one instruction. More interestingly, modern compilers may recognise the old tricks as popcount too: in this benchmark configuration, the Kernighan loop and SWAR version also collapse to popcnt. Without the target instruction, the distinction comes back: Kernighan is a data-dependent loop, while SWAR is a short sequence of shifts and masks. The standard spelling makes the intent explicit either way.
Before C++20 you could use __builtin_popcountll with GCC/Clang or __popcnt64 with MSVC. std::popcount is the portable spelling, and on targets without a hardware popcnt, the implementation can still provide an efficient software fallback.
Benchmarks
| Benchmark | Time |
|---|---|
| BM_popcount_modern | 94.5 ns |
| BM_popcount_kernighan | 94.5 ns |
| BM_popcount_swar | 94.5 ns |
Take-away
Use <bit>, unless you're targeting a freestanding embedded environment that doesn't ship it.
Numerical recipes and row pointers
Part of the inherited wisdom is that row-pointer access to a matrix is faster than index calculation. That was easier to believe when multiplication was more expensive, address-generation hardware was less capable, and compilers had less room to simplify indexing.
Numerical Recipes in C - the well-known red book by Press et al. - also helped popularise this style. Its dmatrix/nrutil helpers use row indirection so algorithms can be written with [i][j] addressing, preserving mathematical notation and making transcriptions from Fortran more direct. That can be a reasonable interface choice; it is not, by itself, a performance win.
For the interested reader, the benchmark code contains three implementations of a matrix class:
flat_matrix(access is done via multiplication, contiguous data)nr_matrix(access is done via row-pointer dereferencing, contiguous data)scattered_matrix(the same asnr_matrix, except the data is not stored contiguously: some junk is allocated between the rows)
nr_matrix and scattered_matrix define operator[] returning a float*. Both of them store an array of pointers - pointing to the row of a matrix, the operator[] then returns the indexed row pointer.
Benchmarking
We run two kernels over these matrices: a row-major sum and a column-major sum. The main result is not "multiplication is free" or "row pointers are always bad"; it is that layout and traversal order dominate. The contiguous flat and Numerical Recipes-style layouts are close to one another. The deliberately scattered layout falls apart, especially once the working set is large.
| Benchmark | Time (400 x 400) | Time (4000 x 4000) |
|---|---|---|
| flat_matrix row-major kernel | 6062 ns | 987614 ns |
| nr_matrix row-major kernel | 6065 ns | 987632 ns |
| scattered_matrix row-major kernel | 6171 ns | 2323394 ns |
| flat_matrix column-major kernel | 33148 ns | 11258036 ns |
| nr_matrix column-major kernel | 36418 ns | 11450827 ns |
| scattered_matrix column-major kernel | 39054 ns | 14800609 ns |
Take-away
Avoid indirection just because it looks clever or historically familiar. Prefer contiguous storage and traversal patterns that match it. If possible, avoid writing your own linear algebra kernels at all: Eigen, BLAS/LAPACK, GLM, or a vendor-tuned library will usually be a better starting point.
const& everywhere vs forwarding intent
Another habit many of us inherited is: "always pass by const&; copies are expensive". That is still a good rule when the function merely observes its argument:
double norm(std::vector<double> const& xs);
But it is the wrong abstraction for wrapper code. If a function's job is to pass arguments on to something else, const& destroys information. An rvalue becomes a const lvalue, move construction is disabled, overload resolution changes, and the caller's intent is lost.
Consider the old-school version of an emplace-style wrapper:
template <typename T>
class bag {
public:
template <typename... Args>
T& emplace(Args const&... args)
{
return xs_.emplace_back(args...);
}
private:
std::vector<T> xs_;
};
This looks harmless, but it is subtly pessimising. Even if the caller writes:
b.emplace(std::string(1024, 'x'));
inside emplace_bad, the argument is now a std::string const&. The vector cannot move from it. It has to copy.
The modern version preserves the caller's value category:
template <typename T>
class bag {
public:
template <typename... Args>
T& emplace(Args&&... args)
{
return xs_.emplace_back(std::forward<Args>(args)...);
}
private:
std::vector<T> xs_;
};
Now lvalues stay lvalues, rvalues stay rvalues, and overload resolution sees what the caller actually wrote.
The same distinction shows up throughout modern C++ APIs:
// observes: const& is fine
void draw(widget const& w);
// consumes/stores: pass by value can be excellent
void set_name(std::string name) { name_ = std::move(name); }
// forwards construction: forwarding references are the right tool
template <typename... Args>
auto make_widget(Args&&... args)
{
return widget(std::forward<Args>(args)...);
}
Benchmarks
| Benchmark | Time (n=64) | Time (n=1024) | Time (n=4096) |
|---|---|---|---|
Using const& |
12.2 ns | 18.6 ns | 72.1 ns |
| Using perfect forwarding | 6.55 ns | 8.74 ns | 31.0 ns |
We can see that perfect forwarding beats const ref arguments in this example. This is because we can elide a copy, when emplacing the string on the internal vector.
Take-away
This is not to say "never use const&". It is more precise:
- use
T const&when you observe an existing object; - use
Tby value when you need your own copy and move from it internally; - use
T&&for explicit sinks; - use forwarding references for generic forwarding wrappers.
Old C++ had fewer ways to say what you meant, so const& became a universal reflex. Modern C++ gives us better vocabulary. Use the type system to preserve intent.
Part 2: The library
The modern C++ standard library has improved substantially. Most of us have met engineers who work somewhere the STL is discouraged, wrapped, or forbidden outright. It may be time to reconsider: standard algorithms and containers are more readable, better tested, and increasingly better optimised.
Ranges and algorithms
Consider a simple task: Given raw voltage samples, calibrate to millivolts, compute residual from a target, square it, apply a weight, and sum the cost.
inline double raw_loop(std::span<double> xs) noexcept
{
double sum = 0.0;
for (double volts : xs) {
auto mv = calibrated_mv(volts);
auto err = residual(mv);
sum += weighted_square(err);
}
return sum;
}
inline double algorithm_call(std::span<double> xs) noexcept
{
return std::accumulate(
xs.begin(), xs.end(), 0.0,
[](double acc, double volts) {
auto mv = calibrated_mv(volts);
auto err = residual(mv);
return weighted_square(err) + acc;
});
}
inline double ranges_pipeline(std::span<double> xs) noexcept
{
auto costs = xs
| std::views::transform(calibrated_mv)
| std::views::transform(residual)
| std::views::transform(weighted_square);
return std::ranges::fold_left(costs, 0.0, std::plus<>{});
}
This is deliberately small. It is also deliberately a good case for the optimiser: the transforms are simple, visible, and inlinable; std::ranges::fold_left is a left fold over the view pipeline - equivalent to f(f(f(f(init, x1), x2), ...), xn). In that setting, the abstraction largely disappears.
Benchmarks
| Benchmark | Time (n=1024) | Time (n=65536) |
|---|---|---|
| Raw loop | 36.0 ns | 2400 ns |
Using std::accumulate |
36.0 ns | 2395 ns |
Using std::ranges::fold_left |
37.9 ns | 2417 ns |
The benchmarks tell the same story: all three versions are essentially equivalent. The ranges pipeline is a little behind in the small case and within noise at the larger size. That is the point: the more expressive version is not paying a dramatic abstraction tax here.
Take-away
Use modern library algorithms. They may not be slower, and they make your intent clearer.
Comments
No comments yet. Start the discussion.