DEV Community 2h ago

Firmware Black Box: diagnosing embedded resets in the field

A device that resets in the field is not always the hardest problem. The harder problem is a device that resets, comes back online, and leaves no evidence about what happened before the reboot. That is where a firmware black box becomes useful. This is the DEV.to edition of a Silicon LogiX technical article. The canonical English source is linked at the end.

What a firmware black box is

A firmware black box is a small diagnostic subsystem inside the firmware. Its job is to preserve enough information to support post-mortem analysis after a reset, watchdog event, HardFault, panic or unexpected reboot. It does not need to record everything. It needs to record the data that helps answer the first diagnostic questions:

Why did the device reset?
How long had it been running?
Which firmware build was installed?
What state was the application in?
Which task was active?
Did the watchdog fire?
Did memory, stack or heap margins collapse?
Did the network, modem, BLE, Wi-Fi or OTA flow fail just before the reboot?

Without that data, every field reset deletes most of the evidence.

Why sporadic resets are expensive

Rare embedded bugs are often more expensive than obvious failures. A crash that happens every time in the same function can usually be analyzed with a debugger, logs and a repeatable test. A reset that appears once every ten days on a customer device is different. The cause may depend on a combination of:

Temperature
Unstable power
Brown-out
Cable length
Enclosure heating
Network drops
Modem state
Memory fragmentation
Stack exhaustion
Long uptime
Race conditions
A peripheral that stops responding
An OTA edge case

In the lab, the product may look clean. In the field, the environment changes. The customer report often becomes: "it rebooted", "it stopped communicating", or "we had to power-cycle it". That is not enough for firmware diagnosis.

What to capture

A good first version does not need to be large. Start with a compact structure that survives the next boot:

Reset reason
Uptime before reset
Firmware version and build ID
Hardware variant
Application state
Last meaningful error
Watchdog counters
Recent application events
Active RTOS task
Stack high-water marks
Minimum heap
OTA state
Network state
Fault registers or core dump when available

For FreeRTOS systems, values such as uxTaskGetStackHighWaterMark can be extremely useful when stack margins are suspected. For Cortex-M systems, fault status registers, stack pointer information and the program counter can turn a blind HardFault into a useful post-mortem report. For ESP32 products, ESP-IDF already provides useful mechanisms such as panic handling and core dump. The best design often combines those built-in tools with application-level event history and upload after reboot.

Watchdog is not enough

The watchdog is essential. It can recover service when firmware stops responding. But if the watchdog resets the system and nothing is stored, the device is operational again while the root cause has disappeared. A good watchdog restarts the device. A good firmware black box explains why the restart was necessary. Useful questions include:

Which task failed to report health?
Was the device reconnecting to the cloud?
Was the modem stuck?
Was heap decreasing over time?
Was an OTA update in progress?
Were there repeated DNS, TLS, MQTT or BLE errors?

This is the difference between recovery and diagnosis.

Logging and diagnostics are different

Many firmware projects have logs. That does not mean they are diagnosable. UART logs are useful during development, but they often disappear in production. If nobody had a terminal connected when the customer device reset, those logs are gone. Diagnostics should be designed for field conditions:

No debugger attached
No serial cable connected
No engineer present at the crash
Limited memory
Possible power loss
Possible network failure
Support team needs a readable report

A compact persistent ring buffer with the latest important events may be more valuable than thousands of UART lines that nobody will ever see.

Storage options

The right storage depends on the product:

Retention RAM for fast data across software resets
Internal flash for small critical reports
NVS or a dedicated partition on ESP32
EEPROM or FRAM for frequent diagnostic counters
Local filesystem on embedded Linux gateways
Cloud upload after reboot for connected devices

Do not make the cloud the only source of truth. The network may be part of the problem. Also be careful with flash wear, power loss during writes, atomic updates and sensitive data. Diagnostic reports should not contain passwords, private keys, tokens or unnecessary personal data.

A practical checklist

Before shipping a connected embedded product, ask:

[ ] Is reset reason collected at boot?
[ ] Are watchdog, brown-out, software reset and manual reset distinguished?
[ ] Is firmware version tied to a build ID or commit?
[ ] Is hardware variant recorded?
[ ] Is there a useful HardFault or panic strategy?
[ ] Are RTOS task health and stack margins observable?
[ ] Is minimum heap recorded?
[ ] Is there a persistent event ring buffer?
[ ] Are network and OTA events included?
[ ] Can support export a report without JTAG?
[ ] Are sensitive data and credentials excluded?

Typical field scenario

Imagine an STM32 or ESP32 controller. It passes lab tests, talks to sensors, sends data to the cloud and supports OTA. The watchdog is enabled, so the team feels safe. After release, some customers report random reboots. Without a black box, the team guesses: maybe the modem, maybe a task stack, maybe power, maybe memory, maybe I2C, maybe Wi-Fi.

With a black box, the next episode can produce a report:

Watchdog reset
Uptime: 91 hours
Application state: MQTT reconnect
Network task stack almost exhausted
Several DNS timeouts in the recent event buffer
Firmware build ID identified

That is not the final fix, but it is a direction. In embedded debugging, a direction can save days or weeks.

Final takeaway

A firmware black box is not an extra feature for perfectionists. It is part of building maintainable embedded products. A device in the field should not only recover. It should explain what happened. When diagnostics are designed with reset reason, watchdog analysis, fault context, RTOS metrics, persistent events, OTA state and safe export, sporadic resets stop being mysterious and become measurable. And a measurable problem is much closer to a fix.

Canonical source: Firmware Black Box: how to find out why an embedded device resets in the field

If you build embedded, IoT or firmware products and want a second pair of eyes on diagnostics, watchdog strategy, OTA behavior or field failures, Silicon LogiX can help turn hard-to-reproduce issues into measurable engineering work.

Read on DEV Community ↗ ← Back to News