google cloud tpuv5e

Google Cloud TPUv5e: Introducing The ML Machines I Help Deploy

Fresh off of Google Cloud Next is a new article and video about Google Cloud TPUv5e. And I’m so excited this information came out because now I can discuss more about the Machine Learning (ML) machines (servers) I help deploy as a Google Data Center Technician. I discussed in a previous blog post that I couldn’t release much information about my role due to the Non-Disclosure Agreement (NDA) I signed. While I still can’t reveal certain parts of my job that aren’t in the above-linked article, there is still much I can discuss.

Google Cloud TPUv5e: What Is It?

Google Cloud TPUv5e is the fifth version of the company’s Tensor Processing Units, or TPUs. According to the website dedicated to the technology:

Google Cloud TPUs are custom-designed AI accelerators, which are optimized for training and inference of large AI models. They are ideal for a variety of use cases, such as chatbots, code generation, media content generation, synthetic speech, vision services, recommendation engines, personalization models, among others.

https://cloud.google.com/tpu

Google designed the TPUs to run neural networks. These chips are proprietary, and include special features like matrix multiply unit (MXU) and proprietary interconnect topology. Thus, if an outside company wanted to use these chips they have to go through Google.

What Makes This Version So Special?

This version is special than the previous four generations because it’s “purpose-built to bring the cost-efficiency and performance required for medium- and large-scale training and inference. TPU v5e delivers up to 2x higher training performance per dollar and up to 2.5x inference performance per dollar for LLMs and gen AI models compared to Cloud TPU v4. At less than half the cost of TPU v4, TPU v5e makes it possible for more organizations to train and deploy larger, more complex AI models.”

AI training is expensive, which can dissuade some medium-sized companies from using it. Now they can using this new ML machine without a huge expenditure of cash.

See The Inside Of Where I Work

Watch this quick video to see inside a Google data center hosting this ML machine:

As you saw that’s the layout of a Google data center hosting the fifth version of their ML machines. And, yes, the data center is that clean. Everyone working on the floor does their best all day everyday to keep the aisles clean.

The motherboard containing the TPU chips is that large. However, it is missing the heatsinks over each chip.

Next, with the close-up of the ML machine you can see there is fiber running throughout the rack and to each machine. I do have to troubleshoot this fiber because it can be dirty or broken, which will cause the machine to malfunction. Unfortunately, optics can die too and I’ll replace them.

Those fiber connections from the ML machines all feed into a Optical Circuit Switch (OCS). To learn more about those network devices please read this Google Cloud blog post. Datacenter Dynamics also did a write-up about the OCS here. I also have to troubleshoot the connections at the OCS for the same reason as I do at each machine. Yes, there’s hundreds of fibers to go through, but Google has a system of finding the specific fiber for a particular machine.

Finally, as you saw in the video the ML machines feature water-cooling. I don’t have much to do with troubleshooting the water-cooling pumps or system as another department handles that. However, I do have to reconnect the pipes with the proper Personal Protective Equipment (PPE) when I work on those machines.