Data Center On-call Life: Determining What Is An Emergency

A big part of data center on-call life was determining with the customer what is an emergency. Now Twitter had rules in place that each department must have more than one server for a particular service. That way one server failing doesn’t bring down the entire service. So when a Software Engineer or Site Reliability Engineer (SRE) contacted us with an emergency I would work with that person to determine if it was an emergency.

Turns out many times the department would have multiple, working backup servers they could use to replace the faulty one. It’s just they rather not do the work to bring it online. Unfortunately, I did run into departments with only one server running a service. And on those occasions I would work to get the server back up quickly. Then I would notify the department about Twitter’s policy. I would provide the documentation on how to request additional servers so they wouldn’t run into this problem again.

Now networking issues were always an emergency. Network Engineering was great about have redundant switches and routers and connections and so on. However, sometimes a major networking device, like a Core Switch, would suffer a failure that brought down multiple racks. Once we got their pings we would rush into action because that many down servers means platform degradation. Thus, Twitter could lose money. And we didn’t want that.


Comments