As a former Site Operations Technician for Twitter one of my main tasks was diagnosing faulty servers in the Atlanta, GA data center. And this task was quite time-consuming at times due to lack of automation. Oh, you would think a Silicon Valley company would have an application that would scan a faulty server and determine the problem, right? I did too. Unfortunately, that wasn’t the case. Let me explain the steps we took.
Diagnosing Faulty Servers Took The Terminal And Various Commands
Yep, the main software tool in my arsenal was the Terminal (on my Macbook) and a bunch of different commands to check the hardware in the server and view the logs. So what commands did I use? Here’s the list:
You can click on each command to learn more about it in detail but I’ll give you a quick overview. I used smartctl to determine the defective hard drive or drives in the server. If S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) returns an error or errors it will show up with this command. A failed hard drive can also pass S.M.A.R.T. but have a bunch of reallocated sectors which means the drive is marking those sectors as unusable. Eventually the drive will fail.
I used dmidecode sometimes to find faulty memory modules, or to get more information about a particular memory module. Usually I used impimtool sel elist to get all the logs from the System Event Log (SEL), which would include information about the defective memory module or even motherboard failure. Finally, I used dmesg on occasion to grep for disk failure or problems with the motherboard or CPU.
Diagnosing Faulty Servers Sometimes Required A Physical Visit
I usually diagnosed servers from the comfort of my desk upstairs. However, I would have to leave my comfortable spot if I couldn’t remote into a server. And this was quite common. Servers need to have power and a network connection for SSH (Secure Shell) to work. Those died in the servers sometimes.
Other times the server may be locked up. Usually I could send a restart message remotely but sometimes I couldn’t. Thus, I had to use my crash cart (a cart with a monitor and keyboard) to connect to the server to see what’s going on. Usually I would reboot the server and see if I could remote into it then. Sometimes I could, other times I couldn’t.
Now I did ask my coworkers for assistance sometimes if they were near a server. This would save me time and a good bit of walking. However, I didn’t like to bother them too much because usually they were on the floor working. I hate it when people interrupt my workflow so I do my best to disrupt theirs.
Automation Finally Arrives!
In the middle of 2020 we finally got automation working to diagnose servers. Several departments work together to develop the proprietary software to remote into servers and determine the faulty hardware. This really helped us Site Operations Techs because we didn’t have to spend so much time manually diagnosing each server. However, the software ran into the same issues we humans ran into if a server wasn’t online. So we still had some job security.