As a Site Operations Technician for Twitter I grew leery troubleshooting networking issues for one reason: A dead switch port. I knew if I found one I just created more work for myself.
Steps I Took To Find A Dead Switch Port
When it came to troubleshooting switches I took more time. Why? Because not only do these devices cost more than a server, but they control the networking for the entire rack. Thus, I didn’t want to disrupt the operation because that would knock all the servers offline. And I would have to alert Network Engineering to my mistake and receive a stern reprimand. (Frankly, that’s happened to me a few times during my tenure at Twitter and it sucks.)
The first troubleshooting step was to swap the network cable between the server and the affected switch port. Physical media does and will fail. And a good troubleshooter will start with physical connections first, in my opinion. Depending on the server in the rack the network cable could be a regular Cat 5e Ethernet cable or a Direct Attach Copper (DAC) cable. I kept both on my cart for quick testing.
Unfortunately, if the issue wasn’t a bad network cable then I would reseat the Small Form-factor Pluggable (SFP) into the switch port. Reseating these can resolve some network issues, and it also resets the connection between the server and the switch. However, a reseat may not resolve the issue. So depending on the type of SFP I would clean it. Finally, I would swap out the SFP with a known-working one to see if that would resolve the issue.
Now, if none of these troubleshooting steps brought up link up I would check the link within the switch’s software. Rarely, a tech would find Network Engineering shutdown a link (i.e. disable it) and forgot to bring it back up. Usually, I and other techs would find the dead switch port while checking the switch’s software. That software was Cisco’s IOS or Juniper’s Junos. Other times we saw hundreds or thousands of errors on the port even when using a known-good cable or SFP.
So Why Did We Chuck Out The Entire Switch Over One Dead Port?
That was the question I had when I discovered I had to replace an entire switch. The explanation makes sense.
When a company has a large data center with thousands of racks maintaining that becomes unruly unless the company has good standard operating procedures. Those procedures include using uniform port numbers. For example: Server #25 in each rack would connect to port 25 (or interface 25) on the switch. (Author’s note: This isn’t exactly the case but I’m using this as an easy-to-understand example.) So if port 25 fails and a tech connects server #25 to port 35 now the uniformity is ruined. Everyone has to remember that rack ABC has a network switch with a different configuration. It won’t work on a grand scale. So to keep the standard operating procedures we replaced the entire switch.
What Happened To The Defective Switches?
All the switches with a dead switch port went to either two places:
- Back to the manufacturer if under warranty
- Sold to a refurbisher or a reseller if the warranty ran out
With option #1 we would get a replacement switch to put back into inventory which would be used in another swap in the future. However, with option #2 the refurbishing company would fix the switch (if possible) and resell it for a lower price. Or the company could market the switch with a dead port and sell it for a lower price.