google's ai hypercomputer

Google’s AI Hypercomputer: My Part In Its Deployment In DCs

It’s always a great day when Google releases information regarding its hardware in their data centers to the public because I can finally talk about my part in its deployment in the DCs. I have older posts on my website about my experience deploying Google’s Tensor Processing Unit (TPU) machines: Specifically TPU v5e and TPU v5p machines. Now in this post I can discuss my part in deploying the latest hardware running Google’s AI Hypercomputer.

What Is Google’s AI Hypercomputer?

According to this page under the Integrated Supercomputing Architecture category for Google Cloud the AI Hypercomputer is:

AI optimized hardware, software, and consumption, combined to improve productivity and efficiency.

https://cloud.google.com/solutions/ai-hypercomputer?hl=en

That hardware includes Google Cloud TPU, Google Cloud GPU, Google Cloud Storage, and its Jupiter network. All of these work together to help process AI models, from small ones to demanding large models. Finally, this hardware allows individuals or companies the ability to serve their models at scale.

My Part In The Hypercomputer’s Deployment

My part in the latest announcement about the new hardware “layer” of Google’s AI Hypercomputer started months and months ago. Specifically in September 2023. I helped deploy the Compute Engine A3 VM machine as shown below:

google's ai hypercomputer

This beast of a machine contains NVIDIA’s H100 Tensor Core GPUs. These GPUs imrpoves the performance of training large models. Google’s machine then pairs that power with an improved network capability. In the image above you can see four network cables coming from each machine. That enables the fast transmission of the large trained models.

Installing these machine took a good amount of labor and a logical and thorough troubleshooting process. Regarding the former each Compute Engine A3 VM machine is heavy! Don’t worry: We have server lifts in each data center. With the latter these machines contain the latest AI hardware which makes them complicated. Thus, Data Center Technicians (DTs) like myself follow a specific troubleshooting process depending on the hardware or software error.

I’m proud of the work myself and the other DTs in the data centers around the world performed to get these machines up and running for the world to use. Seeing the image above with its tidy cabling and metal grill brings a smile to my face. No, really, it does. That’s why I was so happy to see Google finally release the information about this machien to the public.

There is more to come. Google announced its plans to use NVIDIA’s HGX B200 and GB200 NVL72 GPUs in a future machine. Although I can’t discuss that machine now, I will once that information becomes public too.