Notes from the Wired

Runtime Fault Localization in Deep Neural Network Accelerators

May 11, 2026 | 991 words | 5min read

Paper Title: Runtime Fault Localization in Deep Neural Network Accelerators

Link to Paper: https://dl.acm.org/doi/10.1145/3770920

Date: 11. Nov 2025

Paper Type: Systolic arrays, deep neural network (DNN) accelerator, fault detection, fault localization, Checksums

Short Abstract: Systolic arrays are widely used for accelerating deep neural networks (DNNs) due to their high parallelism and data reuse efficiency. However, hardware faults in their numerous processing elements (PEs) can propagate errors and significantly degrade inference accuracy. The paper addresses the open challenge of fault localization in systolic arrays. It proposes a lightweight fault tolerance framework that performs both run-time fault detection and localization during normal operation. The approach uses functional data to generate checksums on-the-fly, eliminating the need for dedicated test patterns or system downtime. Reuslts: 100% fault detection and localization, 2% Area overhead.

1. Introduction

Deep Neural Networks (DNNs) are computationally intensive, driving the development of specialized hardware accelerators, such as Google’s Tensor Processing Unit (TPU), which relies on systolic arrays. Systolic arrays consist of a grid of Processing Elements (PEs) that perform multiply-and-accumulate (MAC) operations with efficient dataflow and high parallelism.

However, systolic arrays are vulnerable to hardware faults. Due to their dataflow nature, an error in a single faulty PE can propagate downstream, significantly degrading DNN inference accuracy.

Key Limitations of Existing Approaches:

Proposed Solution: The paper introduces a runtime fault tolerance framework for systolic arrays that addresses both fault detection and precise fault localization during normal operation:

2. Background

Deep Neural Networks (DNNs) consist of layers of neurons performing massive multiply-and-accumulate (MAC) operations. Systolic arrays accelerate these workloads through a grid of Processing Elements (PEs) with efficient data reuse and parallelism.

The paper focuses on the Weight Stationary (WS) dataflow:

A single faulty PE can cause significant accuracy degradation in DNN inference because errors propagate through the array due to the interconnected dataflow.

Fault Propagation in WS Dataflow:

Faults in the same PEs are reused across different layers, repeatedly corrupting outputs and severely hurting overall model accuracy.

Fault propagation under weight stationary dataflow. (a) A single fault originating from the input register propagates in both column-wise and row-wise directions. (b) A single fault originating from either the MAC unit, weight register, or partial sum register propagates in a column-wise direction. (c) Multiple faults with different types induce identical symptoms as described in (a).
Error propagation in matrix multiplication. (a) An erroneous value in the input matrix propagates row-wise in the output matrix. (b) An erroneous value in the weight matrix propagates column-wise in the output matrix.
Comparison of Runtime Fault Detection and Localization Approaches for DNN Accelerators

4. Methodology: Proposed Fault Localization Framework

4.1 Fault Model

The method covers faults that affect PE computation results, including:

Designed for Integer operation, but can be extended to floating-point.

4.2 Fault Detection

Instead of dedicated test patterns, the method leverages functional data during normal matrix multiplication (X × W = Y) to generate golden checksums on-the-fly.

For each column j of the systolic array, they compute a column checksum CSⱼ, which is the sum of all outputs (partial sums) from that column. Using the mathematical property of matrix multiplication:

CSⱼ = Σᵢ Yᵢⱼ = Σₖ Wₖⱼ × (Σᵢ Xᵢₖ)

This allows efficient verification of column integrity.

Implementation:

Detection Capability:

4.3 Fault Localization

When a checksum mismatch is detected during normal operation, the system switches to a dedicated fault localization mode.

Every PE is equipped with its own error checker that monitors its partial sum output. This is used for localization.

4.4 Proposed Hardware

5. Results

Results:


Some Thoughts

Email Icon reply via email