Subscribe to Blog Notification Emails

Latest Blog Post

The New Era of AI Centers

Jayshree Ullal
by Jayshree Ullal on May 29, 2024 6:00:00 AM

In 1984, Sun was famous for declaring, “The Network is the Computer.” Forty years later we are seeing this cycle come true again with the advent of AI. The collective nature of AI training models relies on a lossless, highly-available network to seamlessly connect every GPU in the cluster to one another and enable peak performance. Networks also connect trained AI models to end users and other systems in the data center such as storage, allowing the system to become more than the sum of its parts. As a result, data centers are evolving into new AI Centers where the networks become the epicenter of AI management.

Trends in AI

To appreciate this let’s first look at the explosion of AI datasets. As the size of large language models (LLMs) increases for AI training, data parallelization becomes inevitable. The number of GPUs needed to train these larger models cannot keep up with the massive parameter count and the dataset size. AI parallelization, be it data, model, or pipeline, is only as effective as the network that interconnects the GPUs. GPUs must exchange and compute global gradients to adjust the model’s weights. To do so, the disparate components of the AI puzzle have to work cohesively as one single AI Center: GPUs, NICs, interconnecting accessories such as optics/cables, storage systems, and most importantly the network in the center of them all.

Today’s Network Silos

There are many reasons and causes of suboptimal performance in today’s AI-based data centers. First and foremost, AI networking demands consistent end-to-end Quality of Service for lossless transport. This means that the NICs in a server, as well as networking platforms, must have uniform markers/mappings and accurate controls and congestion notifications (PFC & ECN with DCQCN) as well as appropriate buffer utilization thresholds so each component can react to network events like congestion promptly, ensuring the sender can precisely control the traffic flow rate to avoid packet drops. Today, the NICs and networking devices are configured separately. Any configuration mismatch can be extremely difficult to debug in large AI networks.

A common reason for poor performance is component failures. Servers, GPUs, NICs, transceivers, cables, switches, and routers can fail resulting in go-back N - or even worse, can stall an entire job, which leads to huge performance penalties. And the probability of component failures becomes even more pronounced as the cluster size grows. Traditionally, GPU vendors’ collective communication libraries (CCLs) will try to discover the underlying network topology using localization techniques, but discrepancies between the discovered topology and the actual one can severely impact job completion times of AI training.

Another aspect of AI networks is that most operators have separate teams designing and managing distinct compute vs. network infrastructures. This involves the use of different orchestration systems for configuration, validation, monitoring, and upgrades. The lack of a single point of control and visibility makes it extremely difficult to identify and localize performance issues. All of these problems are exacerbated as the size of the AI cluster grows.

It’s easy to see how these silos can grow deeper to compound the problem. Split operations between compute vs. networking can lead to challenges linking the technologies together for optimum performance, and to delays in diagnosing and resolving performance degradation or outright failures. Networking itself can bifurcate into islands of InfiniBand HPC clusters distinct from Ethernet-based data centers. This, in turn, can limit investment protection, cause challenges in passing data between the islands, forcing the use of awkward gateways, and in linking compute to storage to end users. Focusing on any one technology (such as compute, for example) in isolation of all other aspects of the holistic solution ignores the interdependent and interconnected nature of the technologies as shown below.

Today's Network Silos

AI-Blog-Art2

Rise of the New AI Center

The new AI Center recognizes and embraces the totality of this modern, interdependent ecosystem. The whole system rises together for optimum performance rather than foundering in isolation as with prior network silos. GPUs need an optimized, lossless network to complete AI training in the shortest time possible, and then those trained AI models need to connect to AI inference clusters to enable end users to query the model. Compute nodes, spanning both GPUs / AI accelerators and CPUs / general compute, need to communicate with and connect to storage systems as well as other IT existing systems in the existing data center. Nothing works alone. The network acts as connective tissue to spark all of those points of interaction, much as a nervous system provides pathways between neurons in humans.

The value in each is the collective outcome enabled by the total system linked together as one, not in the individual components acting alone. For people, the value comes from the thoughts and actions enabled by the nervous system, not the neurons alone. Similarly, the value of an AI Center is the output consumed by end users solving problems with AI, enabled by training clusters linked to inference clusters linked to storage and other IT systems, integrated into a lossless network as the central nervous system. The AI Center shines by eliminating silos to enable coordinated performance tuning, troubleshooting, and operations, with the central network playing a pivotal role to create and power the linked system.

Ethernet at Scale: AI Center

JU-Blog-AI-Center

Arista EOS Powers AI Centers

EOS is Arista's best-in-class operating system that powers the world's largest scale-out AI networks, bringing together all parts of the ecosystem to create the new AI Center. If a network is the nervous system of the AI Center, then EOS is the brain driving the nervous system.

A new innovation from Arista, built into EOS, further extends the interconnected concept of the AI Center by more closely linking the network to connected hosts as a holistic system. EOS extends the network-wide control, telemetry, and lossless QoS characteristics from network switches down to a remote EOS agent running on NICs in directly attached servers/GPUs. The remote agent deployed on the AI NIC/ server transforms the switch to become the epicenter of the AI network to configure, monitor and debug problems on the AI Hosts and GPUs. This allows a singular and uniform point of control and visibility. Leveraging the remote agent, configuration consistency including end-to-end traffic tuning can be ensured as a single homogenous entity. Arista EOS enables AI Center communication for instantaneous tracking and reporting of host and network behaviors. This way failures may be isolated for communication between EOS running in the network and the remote agent on the host. This means that EOS can directly report the network topology, centralizing the topology discovery and leveraging familiar Arista EOS configuration and management constructs across all Arista Etherlink™ platforms and partners. 

Rich ecosystem of partners including AMD, Broadcom, Intel and NVIDIA

With a goal of building robust, hyperscale AI networks that have the lowest job completion times, Arista AI Centers is coalescing the entire ecosystem in the new AI Center of network switches, NICs, transceivers, cables, GPUs, and servers to be configured, managed, and monitored as a single unit. This reduces TCO and improves productivity across compute or network domains. The vision of AI Center is a first step in enabling open, cohesive interoperability and manageability between the AI network and the hosts. We are staying true to our commitment of open standards with Arista EOS, leveraging OpenConfig to enable AI centers.

We are proud to partner with our esteemed colleagues to make this possible.

Welcome to the new open world of AI Centers!

References:

Opinions expressed here are the personal opinions of the original authors, not of Arista Networks. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Arista Networks or any other party.

Jayshree Ullal
Written by Jayshree Ullal
As CEO and Chairperson of Arista, Jayshree Ullal is responsible for Arista's business and thought leadership in AI and cloud networking. She led the company to a historic and successful IPO in June 2014 from zero to a multibillion-dollar business. Formerly Jayshree was Senior Vice President at Cisco, responsible for a $10B business in datacenter, switching and services. With more than 40 years of networking experience, she is the recipient of numerous awards including E&Y's "Entrepreneur of the Year" in 2015, Barron's "World's Best CEOs" in 2018 and one of Fortune's "Top 20 Business persons" in 2019. Jayshree holds a B.S. in Engineering (Electrical) and an M.S. degree in engineering management. She is a recipient of the SFSU and SCU Distinguished Alumni Awards in 2013 and 2016.

Related posts

The Era of Microperimeters

Paradigm Shift to Zero Trust Networking

The new age of edge, multi-cloud, multi-device collaboration for hybrid work has given...

Jayshree Ullal
By Jayshree Ullal - April 30, 2024
The New AI Era: Networking for AI and AI for Networking*

As we all recover from NVIDIA’s exhilarating GTC 2024 in San Jose last week, AI state-of-the-art news seems fast and furious....

Jayshree Ullal
By Jayshree Ullal - March 25, 2024