How can silent data corruption be detected and corrected in AI systems?

Silent data corruption (SDC), sometimes called bit rot or silent data errors (SDEs), refers to errors in data that are not detected by standard error-checking mechanisms, leading to potentially significant data loss or incorrect calculations. SDCs can lead to inaccurate training, incorrect predictions, and unreliable performance. Detecting SDC requires specialized techniques and tools.

SDCs can be transient or random. Transient SDCs can be caused by radiation events like neutrinos or alpha particles. Neutrinos and alpha particles are difficult to predict and even more challenging to stop. Fortunately, they are also rare and do not significantly contribute to SDCs in data centers and most AI systems.

The bigger and more serious source of SDCs are permanent hardware faults resulting from defects in ICs. That’s the focus of this article.

SDCs are quantified in defects per million (DPM) and often exist at the time of fabrication, hence the moniker “time 0 defects.” The vanishingly small feature sizes of advanced ICs can exacerbate the appearance of SDCs, making it impossible to eliminate them.

Figure 1. Microscopic defects cause IC nets to deviate from ideal and are one cause of SDC. (Image: Asset)

Especially in a high-performance IC, small defects and marginalities at numerous points in a device can result in inconsistent results. The patterning on ICs like DRAMs, CPUs, and GPUs is not perfect. Even slight irregularities in size, shape, and spacing can result in SDCs. This is sometimes referred to as the “oatmeal” effect (Figure 1).

Of course, the various types of ICs vulnerable to SDCs are not used in isolation; they are parts of larger systems. A recent study utilized performance data from a fleet of cloud data centers to examine the correlation between SDCs in memory and other system components. Some of the findings included (Figure 2):

Figure 2. Heat map showing the correlations between some causes of SDC. (Image: Meta Research)

Memory errors follow a Pareto distribution where a large portion of the effect comes from a small number of sources.
Non-DRAM failures from the memory controller and channel contribute most of the errors.
Newer, higher-density DRAMs have higher failure rates.
DIMMs with fewer chips and lower transfer widths have lower error rates.
CPU and memory utilization rates, CPU% and Memory%, respectively, are correlated with overall server failure rates.

Detection and mitigation

Detection and mitigation of SDCs is challenging once the ICs are installed in systems. Some defects only occur under specific combinations of factors like temperature, voltage, frequency, and instruction sequences.

In one case, it was observed that 1% of servers were responsible for 97.8% of all correctable errors. One way to mitigate the impact of SDCs is to use redundancy and fault-tolerant architectures, where multiple systems or processors verify the results and validate the data.

That can be expensive and slow overall system operation. Another approach is to identify potentially faulty chips before they are integrated into systems.

For example, Intel’s Data Center Diagnostics Tool (DCDiag) uses multiple mechanisms to identify SDC. It’s based on repeatedly performing an operation or calculation and confirming a correct outcome.

Since these tests explicitly confirm the correctness of every calculation, they have improved the identification of defective parts that cause SDC. Some of the tests include confirming the accuracy of core-to-core and socket-to-socket communications and running complex floating-point, integer, and data manipulation instructions.

The Open Compute Project (OCP) recently established its Server Component Resilience Workstream in response to the increasing challenges of SDC. This workstream focuses on research into hardware-caused SDC and the development of effective detection and mitigation tools. Initial members involved in the workstream include AMD, ARM, Google, Intel, Meta, Microsoft, and NVIDIA.

Summary

With the growing complexity of AI training and models and the shrinking feature sizes of advanced ICs, SDC is a growing problem. The leading cause of SDC is so-called “time zero defects” in hardware that occur during IC fabrication. That increases the challenges with detecting and mitigating its effects. Recently, the OCP established an industry-wide workstream to develop effective tools for dealing with SDC.