C++ Crash Pattern S5 - Race‑Condition Crashes: How to Diagnose and Fix Them
What Is a Race‑Condition Crash?
A race‑condition crash occurs when:
- multiple threads access shared state
- at least one access is a write
- the accesses are not synchronized
- the outcome depends on timing
The crash is not tied to a specific line of code. It is tied to interleavings - the order in which threads execute. S5 crashes are timing failures: the code is correct in isolation, but incorrect when interleavings change.
What Race‑Condition Crashes Look Like
Race‑condition crashes have a distinctive signature:
- Crash location moves - The crash may appear in different functions on different runs.
- Crash disappears under debugging - Breakpoints, logging, or sanitizers change timing and hide the bug.
- Crash frequency depends on load - More threads → more failures. Single‑threaded mode → no failures.
- Backtrace may look valid or corrupted - Sometimes clean, sometimes garbage - depends on the interleaving.
- Reproduction is difficult - We may need stress tests, loops, or special timing to trigger it.
- The crashing line is rarely the root cause - The defect is almost always upstream, in a missing lock or incorrect ownership rule.
Why S5 Nondeterminism Is Different from S2/S3
Race‑condition crashes are nondeterministic because the failure depends on timing, not on corrupted memory. This is different from S2 and S3: heap and stack corruption also appear nondeterministic, but for a different reason - the program state is already broken, so the crash location moves as corrupted data flows through the system. In S5, the program is correct in isolation, but incorrect when two threads interleave in the wrong order - timing matters. The nondeterminism comes from the scheduler, not from memory corruption.
Likely Patterns - Root Causes
Race‑condition crashes come from a small set of mechanisms:
- Unsynchronized read/write access - One thread writes while another reads.
- Double‑delete or premature delete - One thread destroys an object while another still uses it.
- Incorrect use of atomics - Atomics fix visibility, not invariants. We can still race on multi‑field state.
- Missing or incorrect locking - Lock not taken, taken too late, or taken in the wrong order.
- Data structures not designed for concurrency - Vectors, maps, lists, and custom objects are not thread‑safe by default.
- Races inside callbacks - Callbacks fire on different threads and access shared state.
- Races in lifetime management - Weak pointer promoted too late, shared pointer destroyed too early.
Diagnostic Techniques
Debugging S5 means debugging timing, not memory.
- Reproduce under stress - Increase thread count, reduce delays, run loops, or use stress harnesses.
- Use TSAN (Thread Sanitizer) - TSAN is the single most effective tool for detecting data races.
- Add temporary logging - But be aware: logging changes timing and may hide the bug.
- Look for shared state - Any object accessed by multiple threads is suspicious.
- Check lifetime boundaries - Who owns the object? Who destroys it? Is destruction synchronized?
- Examine invariants - Multi‑field invariants require locks, not atomics.
- Reproduce with forced scheduling - Pin threads, add artificial delays, or use deterministic schedulers.
Remediation Steps
Fixing S5 means strengthening synchronization and ownership rules.
- Add locks around shared state - Mutexes,
shared_mutex, or custom guards. - Use message‑passing instead of shared state - Push work to the owning thread.
- Strengthen lifetime management - Use
shared_ptr/weak_ptrcarefully. Destroy objects only when no thread can access them. - Avoid “lock‑free” unless we truly need it - Lock‑free code is extremely hard to get right.
- Use atomics only for simple state - Atomics do not protect invariants across multiple fields.
- Make thread ownership explicit - Document which thread owns which object.
Example 1 - Unsynchronized Access to Shared State
A classic race: two threads modify a shared vector.
std::vector<int> data;
void writer() {
for (int i = 0; i < 1000; ++i) {
data.push_back(i);
}
}
void reader() {
for (int i = 0; i < 1000; ++i) {
int x = data[i]; // sometimes valid, sometimes crash
}
}
Symptom: Sometimes works. Sometimes crashes. Sometimes SIGSEGV. Sometimes out‑of‑range. Sometimes corrupted data.
Diagnostic Path:
- Reproduce under stress → inconsistent results → timing issue.
- Identify shared state →
dataaccessed by two threads. - Check for synchronization → none; vector is not thread‑safe.
- Confirm with TSAN (optional) → write/read race on
data.
This leads directly to the root cause: unsynchronized access to a non‑thread‑safe container.
Root Cause: std::vector is not thread‑safe. Concurrent push_back and read cause reallocation and invalidation.
Fix:
std::mutex m;
void writer() {
std::lock_guard<std::mutex> lock(m);
data.push_back(...);
}
void reader() {
std::lock_guard<std::mutex> lock(m);
int x = data[i];
}
Example 2 - Lifetime Race (Use‑After‑Free)
A worker thread uses an object after another thread destroys it.
struct Job {
void run() { /* ... */ }
};
Job* job = new Job();
void worker() {
job->run(); // sometimes valid, sometimes UAF
}
void cleanup() {
delete job; // races with worker
}
Symptom: Crash location moves. Sometimes SIGSEGV. Sometimes SIGABRT. Sometimes no crash.
Diagnostic Path:
- Observe nondeterminism → suggests lifetime race.
- Check ownership →
jobshared by worker + cleanup. - Check destruction timing →
delete jobmay run while worker is active. - Force scheduling → adding sleeps changes behavior → confirms timing race.
- TSAN (optional) → reports race between
deleteandrun.
Root Cause: Lifetime is not synchronized. job is destroyed while worker still uses it.
Fix: Use shared_ptr or explicit synchronization:
std::shared_ptr<Job> job = std::make_shared<Job>();
void worker() {
auto j = job; // safe promotion
if (j) j->run();
}
void cleanup() {
job.reset(); // safe destruction
}
Example 3 - Using TSAN to Diagnose a Race Condition
This example shows a real race condition that does not crash reliably, but TSAN catches immediately. It demonstrates how to use the tool and how to interpret its output.
#include <thread>
#include <iostream>
int counter = 0;
void worker() {
for (int i = 0; i < 100000; ++i) {
counter++; // unsynchronized write
}
}
int main() {
std::thread t1(worker);
std::thread t2(worker);
t1.join();
t2.join();
std::cout << "counter = " << counter << "\n";
}
This program usually prints something close to 200000, but:
- sometimes prints a smaller number
- sometimes prints a corrupted value
- sometimes crashes
- sometimes works perfectly
These are classic S5 behavior.
Symptom: Nondeterministic output. Crash appears only under load. Crash disappears when adding logging. Crash location moves. Debugger hides the bug. This is the signature of a race‑condition crash.
Diagnostic Path:
Reproduce under stress - Running the program in a loop:
for i in {1..1000}; do ./a.out; doneIt produces inconsistent results.
Try TSAN - Compile with Thread Sanitizer:
clang++ -fsanitize=thread -g -O1 main.cpp -o tsan_testRun it:
./tsan_testTSAN immediately reports the race - TSAN output (simplified):
WARNING: ThreadSanitizer: data race Write of size 4 at counter by thread T1 #0 worker main.cpp:7 Previous write of size 4 at counter by thread T2 #0 worker main.cpp:7 Location is global 'counter' at main.cpp:3
TSAN tells us:
- what is racing (
counter) - where the race happens (line 7)
- which threads are involved (T1 and T2)
- what type of access (write/write)
This is the fastest way to diagnose S5.
Root Cause: counter is shared mutable state accessed by multiple threads without synchronization. Even though int is small, incrementing it is not atomic: load → add → store. Two threads interleave these steps unpredictably.
Fix: Use a mutex:
std::mutex m;
int counter = 0;
void worker() {
for (int i = 0; i < 100000; ++i) {
std::lock_guard<std::mutex> lock(m);
counter++;
}
}
Or use an atomic:
std::atomic<int> counter{0};
void worker() {
for (int i = 0; i < 100000; ++i) {
counter.fetch_add(1, std::memory_order_relaxed);
}
}
After fixing, TSAN reports no races, and the program becomes deterministic.
When It’s Not This Pattern
S5 is not the correct pattern when:
- Crash is deterministic → S1 or S4
- Backtrace is corrupted → S3
- Crash location is stable → S1
- Crash disappears only under sanitizers → S2
- Crash happens on wrong thread → S4
S5 is specifically about timing‑dependent failures.
Summary
Race‑condition crashes happen when multiple threads access shared state without proper synchronization. The failure depends on timing, not code correctness. The crash location moves, disappears under debugging, and reappears under load.
The signature is consistent:
- nondeterministic
- timing‑dependent
- moving crash location
- sometimes clean, sometimes corrupted backtrace
- disappears under debugging
The only thing wrong is the interleaving.
Takeaways
- S5 is timing‑dependent - the crash depends on interleavings.
- Crash location moves - the crashing line is rarely the bug.
- Debugging changes timing - hiding the failure.
- TSAN is our best friend - use it early.
- Locks or message‑passing fix most races.
- Lifetime must be synchronized - destruction races are common.
Comments
No comments yet. Start the discussion.