Hacker News

Trust your compiler: Modern C++

Fast inverse square root

The Q_rsqrt from Quake III, reproduced here:

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;
    i  = 0x5f3759df - ( i >> 1 );
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );

    return y;
}

There are numerous articles explaining how this works, which we will skip here. In short, late 90s CPUs floating point units were nothing like today's. During the development of Quake 3 Arena, iD's developers discovered that a lot of time was being spent determining vertex normals used for their new lighting model. Much of this time was being spent in one operation: the inverse square root. Their approach was to forgo accuracy, and take a "estimate and improve" approach for this one operation. It could beat the naive 1.0f / sqrtf(x) by a wide margin: FSQRT and FDIV could take over 500 cycles in the 8087 FPU, while iD's method used cheap integer operations plus one Newton iteration.

Modern CPUs

Intel introduced rsqrtss and rsqrtps to the x86 family of architectures with SSE; AVX later added the vrsqrtss and vrsqrtps forms for wider SIMD widths. These instructions compute an approximate reciprocal square root directly, with a documented error bound and substantially lower latency than x87 fsqrt. ARMv8/AArch64 has comparable reciprocal-square-root estimate instructions:

ISA Scalar Packed Width
x86-64 SSE rsqrtss rsqrtps 4 x f32
x86-64 AVX vrsqrtss vrsqrtps 8 x f32
ARMv8 NEON frsqrte (scalar) frsqrte 4 x f32

Why this matters?

Writing this code in modern C++ might look something like:

constexpr float Q_rsqrt(float number) noexcept
{
    static_assert(sizeof(float) == sizeof(std::uint32_t));
    auto i     = std::bit_cast<std::uint32_t>(number);
    auto magic = 0x5f3759dfu - (i >> 1);
    auto y     = std::bit_cast<float>(magic);
    return y * (1.5f - (number * 0.5f * y * y));
}

Compare that with the naive version:

constexpr float naive_rsqrt(float x) noexcept
{
    return 1.0f / std::sqrt(x);
}

Now compare the relevant assembly produced with -std=c++23 -O3 -ffast-math -march=znver4 on Compiler Explorer using Clang 21.1.0. The benchmark run used -march=native; the assembly excerpt uses znver4 so the target is explicit. The -ffast-math part matters: it lets the compiler make aggressive, potentially lossy floating-point assumptions, so this example assumes positive, finite inputs and does not preserve every strict IEEE floating-point edge case.

Q_rsqrt(float):
        movd    eax, xmm0
        sar     eax
        mov     ecx, 1597463007
        sub     ecx, eax
        mulss   xmm0, dword ptr [rip + .LCPI0_0]
        movd    xmm1, ecx
        movdqa  xmm2, xmm1
        mulss   xmm2, xmm1
        mulss   xmm0, xmm2
        addss   xmm0, dword ptr [rip + .LCPI0_1]
        mulss   xmm0, xmm1
        ret

naive_rsqrt(float):
        vrsqrtss        xmm1, xmm0, xmm0
        vmulss  xmm0, xmm0, xmm1
        vfmadd213ss     xmm0, xmm1, dword ptr [rip + .LCPI1_0]
        vmulss  xmm1, xmm1, dword ptr [rip + .LCPI1_1]
        vmulss  xmm0, xmm1, xmm0
        ret

Benchmarking

This makes sense in theory, but is it actually true now?

Benchmark Time (scalar) Time (n=1024) Time (n=65536)
Q_sqrt 380 ns 24.5 ns 1865 ns
naive_rsqrt 373 ns 25.0 ns 2161 ns

The scalar case is effectively a draw, with the obvious version slightly ahead. In the array kernels, the Quake version is somewhat faster in this particular run, but not in a way that justifies the trick as a default: the source is less clear, has a narrower useful domain, and is still only competitive because the compiler and CPU are already doing the hard work. Additionally, the naive way provides you with explicit bounds on error.

Take-away

Let the compiler do the work. You get comparable performance, better-defined code, and an implementation that says what it means.

Popcount and bit-twiddling

C++20 gave us std::popcount, std::countl_zero, std::countr_zero, std::bit_width, std::has_single_bit, and friends. On x86, when the relevant target features are enabled, many of these map to a single instruction:

Function x86-64 (BMI/POPCNT) ARMv8
std::popcount popcnt cnt + addv
std::countl_zero lzcnt clz
std::countr_zero tzcnt rbit + clz

Compare:

constexpr int modern(std::uint64_t x) noexcept
{
    return std::popcount(x);
}

constexpr int kernighan(std::uint64_t x) noexcept
{
    int c = 0;
    while (x != 0U) {
        x &= x - 1U;
        ++c;
    }
    return c;
}

constexpr int swar(std::uint64_t x) noexcept
{
    x = x - ((x >> 1) & 0x5555'5555'5555'5555ULL);
    x = (x & 0x3333'3333'3333'3333ULL) + ((x >> 2) & 0x3333'3333'3333'3333ULL);
    x = (x + (x >> 4)) & 0x0f0f'0f0f'0f0f'0f0fULL;
    return static_cast<int>((x * 0x0101'0101'0101'0101ULL) >> 56);
}

With -march=native and popcnt available, pop_modern can be one instruction. More interestingly, modern compilers may recognise the old tricks as popcount too: in this benchmark configuration, the Kernighan loop and SWAR version also collapse to popcnt. Without the target instruction, the distinction comes back: Kernighan is a data-dependent loop, while SWAR is a short sequence of shifts and masks. The standard spelling makes the intent explicit either way.

Before C++20 you could use __builtin_popcountll with GCC/Clang or __popcnt64 with MSVC. std::popcount is the portable spelling, and on targets without a hardware popcnt, the implementation can still provide an efficient software fallback.

Benchmarks

Benchmark Time
BM_popcount_modern 94.5 ns
BM_popcount_kernighan 94.5 ns
BM_popcount_swar 94.5 ns

Take-away

Use <bit>, unless you're targeting a freestanding embedded environment that doesn't ship it.

Numerical recipes and row pointers

Part of the inherited wisdom is that row-pointer access to a matrix is faster than index calculation. That was easier to believe when multiplication was more expensive, address-generation hardware was less capable, and compilers had less room to simplify indexing.

Numerical Recipes in C - the well-known red book by Press et al. - also helped popularise this style. Its dmatrix/nrutil helpers use row indirection so algorithms can be written with [i][j] addressing, preserving mathematical notation and making transcriptions from Fortran more direct. That can be a reasonable interface choice; it is not, by itself, a performance win.

For the interested reader, the benchmark code contains three implementations of a matrix class:

  • flat_matrix (access is done via multiplication, contiguous data)
  • nr_matrix (access is done via row-pointer dereferencing, contiguous data)
  • scattered_matrix (the same as nr_matrix, except the data is not stored contiguously: some junk is allocated between the rows)

nr_matrix and scattered_matrix define operator[] returning a float*. Both of them store an array of pointers - pointing to the row of a matrix, the operator[] then returns the indexed row pointer.

Benchmarking

We run two kernels over these matrices: a row-major sum and a column-major sum. The main result is not "multiplication is free" or "row pointers are always bad"; it is that layout and traversal order dominate. The contiguous flat and Numerical Recipes-style layouts are close to one another. The deliberately scattered layout falls apart, especially once the working set is large.

Benchmark Time (400 x 400) Time (4000 x 4000)
flat_matrix row-major kernel 6062 ns 987614 ns
nr_matrix row-major kernel 6065 ns 987632 ns
scattered_matrix row-major kernel 6171 ns 2323394 ns
flat_matrix column-major kernel 33148 ns 11258036 ns
nr_matrix column-major kernel 36418 ns 11450827 ns
scattered_matrix column-major kernel 39054 ns 14800609 ns

Take-away

Avoid indirection just because it looks clever or historically familiar. Prefer contiguous storage and traversal patterns that match it. If possible, avoid writing your own linear algebra kernels at all: Eigen, BLAS/LAPACK, GLM, or a vendor-tuned library will usually be a better starting point.

const& everywhere vs forwarding intent

Another habit many of us inherited is: "always pass by const&; copies are expensive". That is still a good rule when the function merely observes its argument:

double norm(std::vector<double> const& xs);

But it is the wrong abstraction for wrapper code. If a function's job is to pass arguments on to something else, const& destroys information. An rvalue becomes a const lvalue, move construction is disabled, overload resolution changes, and the caller's intent is lost.

Consider the old-school version of an emplace-style wrapper:

template <typename T>
class bag {
public:
    template <typename... Args>
    T& emplace(Args const&... args)
    {
        return xs_.emplace_back(args...);
    }

private:
    std::vector<T> xs_;
};

This looks harmless, but it is subtly pessimising. Even if the caller writes:

b.emplace(std::string(1024, 'x'));

inside emplace_bad, the argument is now a std::string const&. The vector cannot move from it. It has to copy.

The modern version preserves the caller's value category:

template <typename T>
class bag {
public:
    template <typename... Args>
    T& emplace(Args&&... args)
    {
        return xs_.emplace_back(std::forward<Args>(args)...);
    }

private:
    std::vector<T> xs_;
};

Now lvalues stay lvalues, rvalues stay rvalues, and overload resolution sees what the caller actually wrote.

The same distinction shows up throughout modern C++ APIs:

// observes: const& is fine
void draw(widget const& w);

// consumes/stores: pass by value can be excellent
void set_name(std::string name) { name_ = std::move(name); }

// forwards construction: forwarding references are the right tool
template <typename... Args>
auto make_widget(Args&&... args)
{
    return widget(std::forward<Args>(args)...);
}

Benchmarks

Benchmark Time (n=64) Time (n=1024) Time (n=4096)
Using const& 12.2 ns 18.6 ns 72.1 ns
Using perfect forwarding 6.55 ns 8.74 ns 31.0 ns

We can see that perfect forwarding beats const ref arguments in this example. This is because we can elide a copy, when emplacing the string on the internal vector.

Take-away

This is not to say "never use const&". It is more precise:

  • use T const& when you observe an existing object;
  • use T by value when you need your own copy and move from it internally;
  • use T&& for explicit sinks;
  • use forwarding references for generic forwarding wrappers.

Old C++ had fewer ways to say what you meant, so const& became a universal reflex. Modern C++ gives us better vocabulary. Use the type system to preserve intent.

Part 2: The library

The modern C++ standard library has improved substantially. Most of us have met engineers who work somewhere the STL is discouraged, wrapped, or forbidden outright. It may be time to reconsider: standard algorithms and containers are more readable, better tested, and increasingly better optimised.

Ranges and algorithms

Consider a simple task: Given raw voltage samples, calibrate to millivolts, compute residual from a target, square it, apply a weight, and sum the cost.

inline double raw_loop(std::span<double> xs) noexcept
{
    double sum = 0.0;
    for (double volts : xs) {
        auto mv  = calibrated_mv(volts);
        auto err = residual(mv);
        sum += weighted_square(err);
    }
    return sum;
}

inline double algorithm_call(std::span<double> xs) noexcept
{
    return std::accumulate(
        xs.begin(), xs.end(), 0.0,
        [](double acc, double volts) {
            auto mv  = calibrated_mv(volts);
            auto err = residual(mv);
            return weighted_square(err) + acc;
        });
}

inline double ranges_pipeline(std::span<double> xs) noexcept
{
    auto costs = xs
        | std::views::transform(calibrated_mv)
        | std::views::transform(residual)
        | std::views::transform(weighted_square);
    return std::ranges::fold_left(costs, 0.0, std::plus<>{});
}

This is deliberately small. It is also deliberately a good case for the optimiser: the transforms are simple, visible, and inlinable; std::ranges::fold_left is a left fold over the view pipeline - equivalent to f(f(f(f(init, x1), x2), ...), xn). In that setting, the abstraction largely disappears.

Benchmarks

Benchmark Time (n=1024) Time (n=65536)
Raw loop 36.0 ns 2400 ns
Using std::accumulate 36.0 ns 2395 ns
Using std::ranges::fold_left 37.9 ns 2417 ns

The benchmarks tell the same story: all three versions are essentially equivalent. The ranges pipeline is a little behind in the small case and within noise at the larger size. That is the point: the more expressive version is not paying a dramatic abstraction tax here.

Take-away

Use modern library algorithms. They may not be slower, and they make your intent clearer.

Comments

No comments yet. Start the discussion.