• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Microcontroller Tips

Microcontroller engineering resources, new microcontroller products and electronics engineering news

  • Products
    • 8-bit
    • 16-bit
    • 32-bit
    • 64-bit
  • Applications
    • 5G
    • Automotive
    • Connectivity
    • Consumer Electronics
    • EV Engineering
    • Industrial
    • IoT
    • Medical
    • Security
    • Telecommunications
    • Wearables
    • Wireless
  • Learn
    • eBooks / Tech Tips
    • EE Training Days
    • FAQs
    • Learning Center
    • Tech Toolboxes
    • Webinars/Digital Events
  • Resources
    • Design Guide Library
    • DesignFast
    • LEAP Awards
    • Podcasts
    • White Papers
  • Videos
    • EE Videos & Interviews
    • Teardown Videos
  • EE Forums
    • EDABoard.com
    • Electro-Tech-Online.com
  • Engineering Training Days
  • Advertise
  • Subscribe

How can silent data corruption be detected and corrected in AI systems?

June 18, 2025 By Jeff Shepard Leave a Comment

Silent data corruption (SDC), sometimes called bit rot or silent data errors (SDEs), refers to errors in data that are not detected by standard error-checking mechanisms, leading to potentially significant data loss or incorrect calculations. SDCs can lead to inaccurate training, incorrect predictions, and unreliable performance. Detecting SDC requires specialized techniques and tools.

SDCs can be transient or random. Transient SDCs can be caused by radiation events like neutrinos or alpha particles. Neutrinos and alpha particles are difficult to predict and even more challenging to stop. Fortunately, they are also rare and do not significantly contribute to SDCs in data centers and most AI systems.

The bigger and more serious source of SDCs are permanent hardware faults resulting from defects in ICs. That’s the focus of this article.

SDCs are quantified in defects per million (DPM) and often exist at the time of fabrication, hence the moniker “time 0 defects.” The vanishingly small feature sizes of advanced ICs can exacerbate the appearance of SDCs, making it impossible to eliminate them.

Figure 1. Microscopic defects cause IC nets to deviate from ideal and are one cause of SDC. (Image: Asset)

Especially in a high-performance IC, small defects and marginalities at numerous points in a device can result in inconsistent results. The patterning on ICs like DRAMs, CPUs, and GPUs is not perfect. Even slight irregularities in size, shape, and spacing can result in SDCs. This is sometimes referred to as the “oatmeal” effect (Figure 1).

Of course, the various types of ICs vulnerable to SDCs are not used in isolation; they are parts of larger systems. A recent study utilized performance data from a fleet of cloud data centers to examine the correlation between SDCs in memory and other system components. Some of the findings included (Figure 2):

Figure 2. Heat map showing the correlations between some causes of SDC. (Image: Meta Research)
  • Memory errors follow a Pareto distribution where a large portion of the effect comes from a small number of sources.
  • Non-DRAM failures from the memory controller and channel contribute most of the errors.
  • Newer, higher-density DRAMs have higher failure rates.
  • DIMMs with fewer chips and lower transfer widths have lower error rates.
  • CPU and memory utilization rates, CPU% and Memory%, respectively, are correlated with overall server failure rates.

Detection and mitigation

Detection and mitigation of SDCs is challenging once the ICs are installed in systems. Some defects only occur under specific combinations of factors like temperature, voltage, frequency, and instruction sequences.

In one case, it was observed that 1% of servers were responsible for 97.8% of all correctable errors. One way to mitigate the impact of SDCs is to use redundancy and fault-tolerant architectures, where multiple systems or processors verify the results and validate the data.

That can be expensive and slow overall system operation. Another approach is to identify potentially faulty chips before they are integrated into systems.

For example, Intel’s Data Center Diagnostics Tool (DCDiag) uses multiple mechanisms to identify SDC. It’s based on repeatedly performing an operation or calculation and confirming a correct outcome.

Since these tests explicitly confirm the correctness of every calculation, they have improved the identification of defective parts that cause SDC. Some of the tests include confirming the accuracy of core-to-core and socket-to-socket communications and running complex floating-point, integer, and data manipulation instructions.

The Open Compute Project (OCP) recently established its Server Component Resilience Workstream in response to the increasing challenges of SDC. This workstream focuses on research into hardware-caused SDC and the development of effective detection and mitigation tools. Initial members involved in the workstream include AMD, ARM, Google, Intel, Meta, Microsoft, and NVIDIA.

Summary

With the growing complexity of AI training and models and the shrinking feature sizes of advanced ICs, SDC is a growing problem. The leading cause of SDC is so-called “time zero defects” in hardware that occur during IC fabrication. That increases the challenges with detecting and mitigating its effects. Recently, the OCP established an industry-wide workstream to develop effective tools for dealing with SDC.

References

Computing’s Hidden Menace: The OCP Takes Action Against Silent Data Corruption (SDC), Open Compute Project
Data Center Silent Data Errors: Implications to Artificial Intelligence Workloads & Mitigations, Intel
Detecting silent errors in the wild: Combining two novel approaches to quickly detect silent data corruptions at scale, Engineering at Meta
Examining Silent Data Corruption: A Lurking, Persistent Problem in Computing, Synopsys
Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field, Meta Research
Silent Data Corruption, Google
Silent Data Corruption: A Survey Article, Asset
Silent Data Corruptions: Microarchitectural Perspectives, IEEE Computer Society
Silent Data Errors: Sources, Detection, and Modeling, IEEE

EEWorld related content

What interconnects are used with memory for HPC and AI?
What determines the size of the dataset needed to train an AI?
What is the HPC memory wall and how can you climb over it?
What tools are there to reduce AI power consumption?
What’s the difference between GPUs and TPUs for AI processing?

You may also like:


  • What is the math of negative feedback and how is…

  • What are the different key layers of IoT architecture? part…

  • What are the different key layers of IoT architecture? part…

  • What is the HPC memory wall and how can you…

  • What is the maths of positive feedback and how it…

Filed Under: FAQ, Featured Tagged With: FAQ

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Primary Sidebar

Featured Contributions

Five challenges for developing next-generation ADAS and autonomous vehicles

Securing IoT devices against quantum computing risks

RISC-V implementation strategies for certification of safety-critical systems

What’s new with Matter: how Matter 1.4 is reshaping interoperability and energy management

Edge AI: Revolutionizing real-time data processing and automation

More Featured Contributions

EE TECH TOOLBOX

“ee
Tech Toolbox: 5G Technology
This Tech Toolbox covers the basics of 5G technology plus a story about how engineers designed and built a prototype DSL router mostly from old cellphone parts. Download this first 5G/wired/wireless communications Tech Toolbox to learn more!

EE Learning Center

EE Learning Center

EE ENGINEERING TRAINING DAYS

engineering
“bills
“microcontroller
EXPAND YOUR KNOWLEDGE AND STAY CONNECTED
Get the latest info on technologies, tools and strategies for EE professionals.

DesignFast

Design Fast Logo
Component Selection Made Simple.

Try it Today
design fast globle

Footer

Microcontroller Tips

EE World Online Network

  • 5G Technology World
  • EE World Online
  • Engineers Garage
  • Analog IC Tips
  • Battery Power Tips
  • Connector Tips
  • DesignFast
  • EDA Board Forums
  • Electro Tech Online Forums
  • EV Engineering
  • Power Electronic Tips
  • Sensor Tips
  • Test and Measurement Tips

Microcontroller Tips

  • Subscribe to our newsletter
  • Advertise with us
  • Contact us
  • About us

Copyright © 2025 · WTWH Media LLC and its licensors. All rights reserved.
The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of WTWH Media.

Privacy Policy