Bug Forces Intel to Halt Some Xeon Sapphire Rapids Shipments

Intel has confirmed that it has paused shipments of some of its fourth-gen Xeon Sapphire Rapids processors due to a newly-discovered bug. We received a tip that Intel had paused the shipments, and following up on the matter, we learned several details about the issue from Dylan Patel, Chief Analyst at SemiAnalysis, who says shipments have been paused for certain SKUs since mid-June. We followed up with Intel on the matter, and the company issued the following statement to Tom’s Hardware:

“We became aware of an issue on a subset of 4th Generation Intel Xeon Medium Core Count Processors (SPR-MCC) that could interrupt system operation under certain conditions and are actively investigating. This issue was not observed when running commercially available software, and other 4th Generation Intel Xeon processor SKUs (i.e., XCC and HBM) have not exhibited the issue. Out of an abundance of caution, we did temporarily pause some SPR MCC shipments while we gained confidence in the expected firmware mitigation and expect to release remaining shipments shortly.” — Intel Spokesperson to Tom’s Hardware.

In response to a follow-up question, Intel also told us that it doesn’t expect the firmware mitigation to have an impact on performance.

Intel’s Sapphire Rapids processors are created using two types of underlying designs: The XCC package, which employs four compute tiles (die) to create a single chip, and the MCC package, which uses a single monolithic die. As shown in the slides above, the MCC design is used for chips up to 32 cores, which are the source of high-volume sales for Intel, while the XCC variants are used for the halo chips between 36 and 60 cores.

“Intel has faced another crop of design issues related to Sapphire Rapids MCC, the highest volume version of Sapphire Rapids. The 2-socket and 4-socket SKUs have paused shipments due to a timing issue since mid-June,” Patel said.

Intel hasn’t confirmed that the issue is confined to dual- and quad-socket SKUs, instead classifying this issue as limited to a ‘subset’ of the SKUs, and hasn’t stated when the pause in shipments began. Intel also hasn’t confirmed Patel’s assertions that the bug is timing-related, or given us any clarification on the nature of the issue.

A timing issue could include any number of possibilities ranging from UPI interconnect to instruction timing issues, so the true nature of the bug remains nebulous for now. We do know that Intel can correct the issue with a firmware fix that apparently remains in validation for now, so the issue will not require a redesign or new revision/stepping to fix. Additionally, since new firmware is an adequate fix, Intel might not be required to replace any processors already in the field — although it could pose a validation headache for its customers.

Intel has earned plenty of criticism not only for its missteps on process node tech that delayed Sapphire Rapids, but also for the issues in its design and validation methodology that led to further delays and numerous new stepping (a typically minor redesign that requires a new version of silicon to correct an issue). Intel’s Sapphire Rapids has been plagued with rumors that its design/verification missteps led to 12 steppings. Naturally, that led to severe production delays and missed launch dates.

The company has since communicated that it plans to take a different approach to its design, simulation, and validation flow that will correct those issues. Intel says those adjustments will kick in fully in the next generation of Xeon processors.

Intel says this new Sapphire Rapids bug wasn’t encountered while “running commercially available software,” and it obviously wasn’t caught during validation. This type of situation isn’t entirely unheard of; nearly all complex chips have both known and unknown errata and bugs that are addressed with firmware, driver, and software workarounds that can reduce or eliminate those issues, and they ship that way — that’s the very nature of modern semiconductor design and production.

For example, Intel’s Skylake generation of processors shipped with 53 known errata, and six months later, Intel listed another 40 errata. Another example is the recent discovery that AMD’s EPYC Rome chips crash after 1,044 days of uptime. Some bugs are simply left unfixed, as they aren’t deemed critical enough to fix, or they are fixed with a combination of firmware and software. The most critical bugs sometimes require a new stepping to correct, which is the worst-case scenario. Luckily for Intel, that doesn’t seem to be the case here.

However, while bugs aren’t uncommon, it is uncommon for those types of bugs to lead to a halt in shipments, implying that this is more than a garden-variety error. Intel hasn’t clarified when it plans to resume shipments for Sapphire Rapids, but we’ll update our coverage as we learn more.

Source link