google cloud tpu v5p

Google Cloud TPU v5p: ML Machines I Help Deploy

Fresh off of the Gemini announcement is a new article and video about Google Cloud TPU v5p. And I’m so excited this information came out because now I can discuss more about the Machine Learning (ML) machines (servers) I help deploy as a Google Data Center Technician. I discussed in a previous blog post that I couldn’t release much information about my role due to the Non-Disclosure Agreement (NDA) I signed. While I still can’t reveal certain parts of my job that aren’t in the above-linked article, there is still much I can discuss.

Google Cloud TPU v5p: What Is It?

Google Cloud TPU v5p is the fifth version of the company’s Tensor Processing Units, or TPUs. According to the website dedicated to the technology:

Google Cloud TPUs are custom-designed AI accelerators, which are optimized for training and inference of large AI models. They are ideal for a variety of use cases, such as chatbots, code generation, media content generation, synthetic speech, vision services, recommendation engines, personalization models, among others.

Google designed the TPUs to run neural networks. These chips are proprietary, and include special features like matrix multiply unit (MXU) and proprietary interconnect topology. Thus, if an outside company wanted to use these chips they have to go through Google.

What Makes This Version So Special?

This version is special than the previous four generations because it’s Google’s “most powerful, scalable, and flexible AI accelerator thus far. TPUs have long been the basis for training and serving AI-powered products like YouTube, Gmail, Google Maps, Google Play, and Android.”

How powerful? Here’s the stats:

By contrast, Cloud TPU v5p, is our most powerful TPU thus far. Each TPU v5p pod composes together 8,960 chips over our highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology. Compared to TPU v4, TPU v5p features more than 2X greater FLOPS and 3X more high-bandwidth memory (HBM).

Designed for performance, flexibility, and scale, TPU v5p can train large LLM models 2.8X faster than the previous-generation TPU v4. Moreover, with second-generation SparseCores, TPU v5p can train embedding-dense models 1.9X faster than TPU v42.

See The Inside Of Where I Work

As you saw that’s the layout of a Google data center hosting the fifth version of their ML machines. And, yes, the data center is that clean. Everyone working on the floor does their best all day everyday to keep the aisles clean.

Next, with the close-up of the ML machine you can see there is fiber running throughout the rack and to each machine. I do have to troubleshoot this fiber because it can be dirty or broken, which will cause the machine to malfunction. Unfortunately, optics can die too and I’ll replace them.

Those fiber connections from the ML machines all feed into a Optical Circuit Switch (OCS). To learn more about those network devices please read this Google Cloud blog post. Datacenter Dynamics also did a write-up about the OCS here. I also have to troubleshoot the connections at the OCS for the same reason as I do at each machine. Yes, there’s hundreds of fibers to go through, but Google has a system of finding the specific fiber for a particular machine.

Finally, as you saw in the video the ML machines feature water-cooling. I don’t have much to do with troubleshooting the water-cooling pumps or system as another department handles that. However, I do have to reconnect the pipes with the proper Personal Protective Equipment (PPE) when I work on those machines.

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x