Today I want to discuss what happens when a data center receives a defective computer part installed in their servers on a mass scale. A few years ago Samsung sold Twitter defective RAM that ended up in hundreds of their newest servers. So how did the company resolve the issue? With an extensive amount of manual labor and thousands of replacement RAM modules.
Samsung Sold Twitter Defective RAM And Discovered The Issue Too Late
I looked online to see if I could find articles about this issue and I came across this article from Tom’s Hardware in 2019. Unfortunately, I can’t exactly remember if the issue at the Twitter data centers happened in 2019 or 2020. Either way reading that article contaminated equipment at a factor caused the company to dispose of untold millions of dollars of DRAM wafers. Now it appears from that article those defectives wafers didn’t make it out to customers.
However, another defect in the memory modules Twitter received for its data centers did and Samsung discovered the issue too late. By the time the company did the servers were on their way to both the Sacramento and Atlanta data centers. (In some instances, some servers were already at the data centers and in production. While these servers were fine eventually the memory would fail.)
What Was The Resolution?
The resolution to this problem was to remove all the sticks of RAM from each server and install new ones. The affected server model held 12 sticks of RAM if I remember correctly. Thus, Samsung had to replaces thousands of sticks in both data centers.
Not only that, the company needed to notate the serial number of each memory module from the server and then record the serial number of the replacement module. Thankfully Samsung had the serial numbers of the defective sticks of memory. So both the company, and our server manufacturer vendor, could keep track and remove all the bad modules.
Yet, this caused a question to come up: Who was going to replace all these modules? Would it be the Site Operation Technicians at Twitter? Nope. We were busy enough with the regular break-fix tickets. Would it be Samsung? Nope. They didn’t have the manpower to send out to both data centers to do the replacement. Instead, they would supply the memory and pay for our server manufacturer vendor to send a team of their techs to do the work. And it was quite a job!
Samsung Sold Twitter Defective RAM And The Replacement Took Weeks
Because Samsung sold Twitter defective RAM and the replacement was going to take weeks Twitter’s Hardware Engineering created an emergency project to track everything. Two of my coworkers at Atlanta stopped their regular work to handle the project, and the same happened at Sacramento. Those coworkers had to coordinate the team replacing the RAM, and keep track of all the serial numbers of the memory modules. They also had to coordinate with the departments of the servers already in productions to take them offline at specific times so the replacement could happen.
On top of that, each server had to go through an extensive memory test to verify the replacement sticks were fine. Once those passed then those servers were fine to move onto the next step of the deployment process: Either given back to the department, or if the server hadn’t been placed in production before it would marked as ready for deployment.
Overall, this product took weeks for the team to complete. We became chummy with the techs from our server manufacturer vendor and would invite them to our office to eat snacks and lunch. And they could eat! It was fine, though. We just ordered more snacks and drinks.