Who Checks the Checker? Enhancing Component-level Architectural SEU Fault Tolerance for End-to-End SoC Protection

2026-03-27 • Hardware Architecture

Hardware Architecture

AI summaryⓘ

The authors study how to protect small computer chips called SoCs from errors caused by radiation, which can mess up their operation. Instead of just protecting parts like processors and memory separately, they also protect the connections and decision-making parts that connect those components. By combining different protection methods carefully, they make the whole chip very resistant to errors while keeping the extra size and cost lower. They tested their approach on a RISC-V chip and showed it can handle over 99.9% of faults with less added cost than some traditional methods.

Single-event upset (SEU)System-on-chip (SoC)Fault toleranceArchitectural fault toleranceVoting logicInterconnectRISC-VFault injectionHardware design overheadTriplication

Authors

Michael Rogenmoser, Philippe Sauter, Chen Wu, Angelo Garofalo, Luca Benini

Abstract

Single-event upset (SEU) fault tolerance for systems-on-chip (SoCs) in radiation-heavy environments is often addressed by architectural fault-tolerance approaches protecting individual SoC components (e.g., cores, memories) in isolation. However, the protection of voting logic and interconnections among components is also critical, as these become single points of failure in the design. We investigate combining multiple fault-tolerance approaches targeting individual SoC components, including interconnect and voting logic to ensure end-to-end SoC-level architectural SEU fault tolerance, while minimizing implementation area overheads. Enforcing an overlap between the protection methods ensures hardening of the whole design without gaps, while curtailing overheads. We demonstrate our approach on a RISC-V microcontroller SoC. SEU fault-tolerance is assessed with simulation-based fault injection. Overheads are assessed with full physical implementation. Tolerance to over 99.9% of faults in both RTL and implemented netlist is demonstrated. Furthermore, the design exhibits 22% lower implementation overhead compared to a single global fault-tolerance method, such as fine-grained triplication.

View PDFOpen arXiv