Here’s how troubleshooting dead servers at Twitter lead to myself taking on a long-running project working with the company’s server vendor. This issue started in late 2018 or early 2019 if I remember correctly. Actually, my manager gave me the project at random in a team meeting. While this shocked me I eventually used this project as one of the reasons for my later promotion.
Troubleshooting Dead Servers At Twitter: It Was A Specific Server Model
When it came to me troubleshooting dead servers at Twitter it was a specific server model. Since our server had proprietary server names I’ll call this model “SKU X.”
While working tickets I would come across those for SKU X regarding the server being inaccessible. Upon checking the server physically, as I couldn’t remote into it, I found it completely dead. No power at all. So I started my checks to see if the problem was the power supply or something else. My troubleshooting steps lead me to a dead power supply so I treated the issue as such.
Unfortunately, the tickets for SKU X for inaccessibility grew as the month went on. And each time I checked the defective servers in the different racks each would have no power. With this SKU X we, the Site Operations Technicians, couldn’t replace the power supply. So we would take the server out of the rack, place it in our defective parts department (RMA), and wait for the server manufacturer vendor to fix it. Twitter had a service contact with each of their vendors to send a repair technician routinely to fix defective servers us SiteOps techs couldn’t repair.
An Investigation Starts
Since the number of SKU X grew our server manufacturer vendor repair tech became suspicious. Upon checking each server the power supply wasn’t the culprit. I would work if placed into another server. And outside of the server but plugged into an outlet, the power supply passed tests from a power supply tester.
So why did these SKU X server lose power?
Well, the repair tech talked to his team for a possible issue. Then he contacted me to ask for a favor: Could he take the motherboards out of each of the defective servers and bring them back to his headquarters for extensive testing? I had to go ask my manager if that was okay to do since equipment couldn’t leave the data center without permission. My manager said it was fine since it was a motherboard and that couldn’t hold customer data. So the repair tech took ten motherboards with him back to headquarters.
We Finally Find The Culprit
About a week or so later our server manufacturer vendor informed us the problem with SKU X was the power regular chip on the motherboard. Unfortunately, it was not soldered properly onto the board and eventually the connection would fail. Thus, the server could no longer get power from the power supply. This would only happen when the server was powered off. Thus, if the servers continued to run then they would always have power even if the chip failed.
This issue was widespread because Twitter had thousands of SKU X servers in the fleet. And the servers did turn off on occasion due repairs, loss of power in the rack, or even the department shutting the server off for one reason or another.
Troubleshooting Dead Servers: The Aftermath
The aftermath of troubleshooting dead servers at Twitter saw me working with the server manufacturer vendor to collect future SKU X servers for repair. I would take the defective servers to the RMA area and the repair tech would take the motherboard out and send it for repair. Turns out another company manufactured the boards for our vendor so that company repaired them for no charge. And the server manufacturer vendor extended the warranty for SKU X for an additional three years.
This project did take up a good amount of my time per week because sometimes SKU X would die in droves. However, there were times the servers didn’t die and I could focus on my other daily tasks.