DEV Community 4h ago

C++ Crash Pattern S5 - Race‑Condition Crashes: How to Diagnose and Fix Them

What Is a Race‑Condition Crash?

A race‑condition crash occurs when:

multiple threads access shared state
at least one access is a write
the accesses are not synchronized
the outcome depends on timing

The crash is not tied to a specific line of code. It is tied to interleavings - the order in which threads execute. S5 crashes are timing failures: the code is correct in isolation, but incorrect when interleavings change.

What Race‑Condition Crashes Look Like

Race‑condition crashes have a distinctive signature:

Crash location moves - The crash may appear in different functions on different runs.
Crash disappears under debugging - Breakpoints, logging, or sanitizers change timing and hide the bug.
Crash frequency depends on load - More threads → more failures. Single‑threaded mode → no failures.
Backtrace may look valid or corrupted - Sometimes clean, sometimes garbage - depends on the interleaving.
Reproduction is difficult - We may need stress tests, loops, or special timing to trigger it.
The crashing line is rarely the root cause - The defect is almost always upstream, in a missing lock or incorrect ownership rule.

Why S5 Nondeterminism Is Different from S2/S3

Race‑condition crashes are nondeterministic because the failure depends on timing, not on corrupted memory. This is different from S2 and S3: heap and stack corruption also appear nondeterministic, but for a different reason - the program state is already broken, so the crash location moves as corrupted data flows through the system. In S5, the program is correct in isolation, but incorrect when two threads interleave in the wrong order - timing matters. The nondeterminism comes from the scheduler, not from memory corruption.

Likely Patterns - Root Causes

Race‑condition crashes come from a small set of mechanisms:

Unsynchronized read/write access - One thread writes while another reads.
Double‑delete or premature delete - One thread destroys an object while another still uses it.
Incorrect use of atomics - Atomics fix visibility, not invariants. We can still race on multi‑field state.
Missing or incorrect locking - Lock not taken, taken too late, or taken in the wrong order.
Data structures not designed for concurrency - Vectors, maps, lists, and custom objects are not thread‑safe by default.
Races inside callbacks - Callbacks fire on different threads and access shared state.
Races in lifetime management - Weak pointer promoted too late, shared pointer destroyed too early.

Diagnostic Techniques

Debugging S5 means debugging timing, not memory.

Reproduce under stress - Increase thread count, reduce delays, run loops, or use stress harnesses.
Use TSAN (Thread Sanitizer) - TSAN is the single most effective tool for detecting data races.
Add temporary logging - But be aware: logging changes timing and may hide the bug.
Look for shared state - Any object accessed by multiple threads is suspicious.
Check lifetime boundaries - Who owns the object? Who destroys it? Is destruction synchronized?
Examine invariants - Multi‑field invariants require locks, not atomics.
Reproduce with forced scheduling - Pin threads, add artificial delays, or use deterministic schedulers.

Remediation Steps

Fixing S5 means strengthening synchronization and ownership rules.

Add locks around shared state - Mutexes, shared_mutex, or custom guards.
Use message‑passing instead of shared state - Push work to the owning thread.
Strengthen lifetime management - Use shared_ptr/weak_ptr carefully. Destroy objects only when no thread can access them.
Avoid “lock‑free” unless we truly need it - Lock‑free code is extremely hard to get right.
Use atomics only for simple state - Atomics do not protect invariants across multiple fields.
Make thread ownership explicit - Document which thread owns which object.

Example 1 - Unsynchronized Access to Shared State

A classic race: two threads modify a shared vector.

std::vector<int> data;

void writer() {
    for (int i = 0; i < 1000; ++i) {
        data.push_back(i);
    }
}

void reader() {
    for (int i = 0; i < 1000; ++i) {
        int x = data[i]; // sometimes valid, sometimes crash
    }
}

Symptom: Sometimes works. Sometimes crashes. Sometimes SIGSEGV. Sometimes out‑of‑range. Sometimes corrupted data.

Diagnostic Path:

Reproduce under stress → inconsistent results → timing issue.
Identify shared state → data accessed by two threads.
Check for synchronization → none; vector is not thread‑safe.
Confirm with TSAN (optional) → write/read race on data.

This leads directly to the root cause: unsynchronized access to a non‑thread‑safe container.

Root Cause: std::vector is not thread‑safe. Concurrent push_back and read cause reallocation and invalidation.

Fix:

std::mutex m;

void writer() {
    std::lock_guard<std::mutex> lock(m);
    data.push_back(...);
}

void reader() {
    std::lock_guard<std::mutex> lock(m);
    int x = data[i];
}

Example 2 - Lifetime Race (Use‑After‑Free)

A worker thread uses an object after another thread destroys it.

struct Job {
    void run() { /* ... */ }
};

Job* job = new Job();

void worker() {
    job->run(); // sometimes valid, sometimes UAF
}

void cleanup() {
    delete job; // races with worker
}

Symptom: Crash location moves. Sometimes SIGSEGV. Sometimes SIGABRT. Sometimes no crash.

Diagnostic Path:

Observe nondeterminism → suggests lifetime race.
Check ownership → job shared by worker + cleanup.
Check destruction timing → delete job may run while worker is active.
Force scheduling → adding sleeps changes behavior → confirms timing race.
TSAN (optional) → reports race between delete and run.

Root Cause: Lifetime is not synchronized. job is destroyed while worker still uses it.

Fix: Use shared_ptr or explicit synchronization:

std::shared_ptr<Job> job = std::make_shared<Job>();

void worker() {
    auto j = job; // safe promotion
    if (j) j->run();
}

void cleanup() {
    job.reset(); // safe destruction
}

Example 3 - Using TSAN to Diagnose a Race Condition

This example shows a real race condition that does not crash reliably, but TSAN catches immediately. It demonstrates how to use the tool and how to interpret its output.

#include <thread>
#include <iostream>

int counter = 0;

void worker() {
    for (int i = 0; i < 100000; ++i) {
        counter++; // unsynchronized write
    }
}

int main() {
    std::thread t1(worker);
    std::thread t2(worker);
    t1.join();
    t2.join();
    std::cout << "counter = " << counter << "\n";
}

This program usually prints something close to 200000, but:

sometimes prints a smaller number
sometimes prints a corrupted value
sometimes crashes
sometimes works perfectly

These are classic S5 behavior.

Symptom: Nondeterministic output. Crash appears only under load. Crash disappears when adding logging. Crash location moves. Debugger hides the bug. This is the signature of a race‑condition crash.

Diagnostic Path:

Reproduce under stress - Running the program in a loop:
```
for i in {1..1000}; do ./a.out; done
```
It produces inconsistent results.

Try TSAN - Compile with Thread Sanitizer:

clang++ -fsanitize=thread -g -O1 main.cpp -o tsan_test

Run it:

./tsan_test

TSAN immediately reports the race - TSAN output (simplified):

WARNING: ThreadSanitizer: data race
Write of size 4 at counter by thread T1
  #0 worker main.cpp:7
Previous write of size 4 at counter by thread T2
  #0 worker main.cpp:7
Location is global 'counter' at main.cpp:3

TSAN tells us:

what is racing (counter)
where the race happens (line 7)
which threads are involved (T1 and T2)
what type of access (write/write)

This is the fastest way to diagnose S5.

Root Cause: counter is shared mutable state accessed by multiple threads without synchronization. Even though int is small, incrementing it is not atomic: load → add → store. Two threads interleave these steps unpredictably.

Fix: Use a mutex:

std::mutex m;
int counter = 0;

void worker() {
    for (int i = 0; i < 100000; ++i) {
        std::lock_guard<std::mutex> lock(m);
        counter++;
    }
}

Or use an atomic:

std::atomic<int> counter{0};

void worker() {
    for (int i = 0; i < 100000; ++i) {
        counter.fetch_add(1, std::memory_order_relaxed);
    }
}

After fixing, TSAN reports no races, and the program becomes deterministic.

When It’s Not This Pattern

S5 is not the correct pattern when:

Crash is deterministic → S1 or S4
Backtrace is corrupted → S3
Crash location is stable → S1
Crash disappears only under sanitizers → S2
Crash happens on wrong thread → S4

S5 is specifically about timing‑dependent failures.

Summary

Race‑condition crashes happen when multiple threads access shared state without proper synchronization. The failure depends on timing, not code correctness. The crash location moves, disappears under debugging, and reappears under load.

The signature is consistent:

nondeterministic
timing‑dependent
moving crash location
sometimes clean, sometimes corrupted backtrace
disappears under debugging

The only thing wrong is the interleaving.

Takeaways

S5 is timing‑dependent - the crash depends on interleavings.
Crash location moves - the crashing line is rarely the bug.
Debugging changes timing - hiding the failure.
TSAN is our best friend - use it early.
Locks or message‑passing fix most races.
Lifetime must be synchronized - destruction races are common.

Read on DEV Community ↗ ← Back to News

C++ Crash Pattern S5 - Race‑Condition Crashes: How to Diagnose and Fix Them

What Is a Race‑Condition Crash?

What Race‑Condition Crashes Look Like

Why S5 Nondeterminism Is Different from S2/S3

Likely Patterns - Root Causes

Diagnostic Techniques

Remediation Steps

Example 1 - Unsynchronized Access to Shared State

Example 2 - Lifetime Race (Use‑After‑Free)

Example 3 - Using TSAN to Diagnose a Race Condition

When It’s Not This Pattern

Summary

Takeaways

Comments