This post is for every IT person who has been on-call at their job. Because you understand the fear that comes from a ping. Be it from Slack or Microsoft Teams or maybe even PagerDuty. Some break out in a cold sweat. Others cuss at their device. As for me? I would groan with each ping as I had to stop what I was doing to focus on the new issue. Let me tell you about data center on-call life.
Data Center On-call Life: So Many Frozen Servers
Most of my data center on-call experience comes from my stint working at Twitter as a Site Operations Technician at the Atlanta, GA location. It worked like this: There was a primary and secondary on-call tech for a week. The primary tech handled all the pings from the data center channel in Slack. Yet, the secondary tech would pop in to answer those if the primary person was busy. The job of the secondary tech was to handle after-hours maintenance during that week. (I will write about my experience with after-hours maintenance in a future post.)
From what I saw and experience we techs got pings for frozen (or unresponsive) servers. One of the departments would ping us on-call techs in the data center channel to ask us to physically reboot a frozen server. Basically a zombie process would grow so large it would prevent any commands to kill it. In turn, that zombie process would cause problems with that particular service and the best solution was a physical reboot. Sometimes the department could isolate the server that removing it from the server pool, and they would do that during off-hours. That way the primary tech didn’t have to travel to the data center in the middle of the night.
After the reboot the server would usually work properly. However, there were times the servers didn’t because it had failing hardware. Then I would have to do an emergency repair to get that server back in a good, working state.
Data Center On-call Life: Determining What Is An Emergency
A big part of data center on-call life was determining with the customer what is an emergency. Now Twitter had rules in place that each department must have more than one server for a particular service. That way one server failing doesn’t bring down the entire service. So when a Software Engineer or Site Reliability Engineer (SRE) contacted us with an emergency I would work with that person to determine if it was an emergency.
Turns out many times the department would have multiple, working backup servers they could use to replace the faulty one. It’s just they rather not do the work to bring it online. Unfortunately, I did run into departments with only one server running a service. And on those occasions I would work to get the server back up quickly. Then I would notify the department about Twitter’s policy. I would provide the documentation on how to request additional servers so they wouldn’t run into this problem again.
Now networking issues were always an emergency. Network Engineering was great about have redundant switches and routers and connections and so on. However, sometimes a major networking device, like a Core Switch, would suffer a failure that brought down multiple racks. Once we got their pings we would rush into action because that many down servers means platform degradation. Thus, Twitter could lose money. And we didn’t want that.