4 min read

The New AI Era: Networking for AI and AI for Networking*

Picture of Jayshree Ullal Jayshree Ullal : Mar 25, 2024 9:00:00 AM

Jayshree Ullal 2024

The New AI Era: Networking for AI and AI for Networking*

As we all recover from NVIDIA’s exhilarating GTC 2024 in San Jose last week, AI state-of-the-art news seems fast and furious. Nvidia’s latest Blackwell GPU announcement and Meta’s blog validating Ethernet for their pair of clusters with 24,000 GPUs to train on their Llama 3 large language model (LLM) made the headlines. Networking has come a long way, accelerating pervasive compute, storage, and AI workloads for the next era of AI. Our large customers across every market segment, as well as the cloud and AI titans, recognize the rapid improvements in productivity and unprecedented insights and knowledge that AI enables. At the heart of many of these AI clusters is the flagship Arista 7800R AI spine.

Robust Networking for AI

Activating these new AI use cases requires the LLMs to be trained first. These backend AI training clusters require a fundamentally new approach to building networks, given the massively parallelized workloads characterized by elephant traffic flows that can cause congestion throughout the network, impacting job completion time (JCT) measured across the entire workload. Traffic congestion in any single flow can lead to a ripple effect slowing down the entire AI cluster, as the workload must wait for that delayed transmission to complete. AI clusters must be architected with massive capacity to accommodate these traffic patterns from distributed GPUs, with deterministic latency and lossless deep buffer fabrics designed to eliminate unwanted congestion.

Arista’s Etherlink for Standards Compatibility

As the Ultra Ethernet Consortium (UEC) completes its extensions to improve Ethernet for AI workloads, Arista assures customers that we can offer UEC-compatible products, easily upgradable to the standards as UEC firms up in 2025. To quote Meta, “Through careful co-design of the network, software, and model architectures, we have successfully used both Ethernet/RoCE and InfiniBand clusters for large GenAI workloads (including our ongoing training of Llama 3) without any network bottlenecks.” This validates that lossless Ethernet can not only meet the rigorous baseline to host AI workloads but can also evolve to support the UEC open standards when they are available.

Arista Etherlink™ is standards-based Ethernet with UEC-compatible features. These include dynamic load balancing, congestion control, and reliable packet delivery to all NICs supporting RoCE. Arista Etherlink will be supported across a broad range of 800G systems and line cards based on Arista EOS^Ⓡ. As the UEC specification is finalized, Arista AI platforms will be upgradeable to be compliant.

Arista’s Etherlink platforms offer three important network characteristics:

Network Scale: AI workloads push the “collective” operation, where allreduce and all-to-all are the dominant collective types. Today’s models are already moving from billions to one trillion parameters with GPT-4. Of course, we have others such as Google Gemini, open source Llama and xAI’s Grok. During the compute-exchange-reduce cycle, the volume of data exchanged is so significant that any slowdown due to a poor network can critically impact the AI application performance. The Arista Etherlink AI topology will allow every flow to simultaneously access all paths to the destination with dynamic load balancing at multi-terabit speeds. Arista Etherlink supports a radix from 1,000 to 100,000 GPU nodes today, which will go to more than one million GPUs in the future.
Predictable, Deterministic Latency: Rapid and reliable bulk transfer from source to destination is key to all AI job completion. Per-packet latency is important, but the AI workload is most dependent on the timely completion of an entire processing step. In other words, the latency of the whole message is critical. Flexible ordering mechanisms use all Etherlink paths from the NIC to the switch to guarantee end-to-end predictable communication.
Congestion Management: Managing AI network congestion is a common “incast” problem. It can occur on the last link of the AI receiver when multiple uncoordinated senders simultaneously send traffic to it. To avoid hotspots or flow collisions across expensive GPU clusters, algorithms are being defined to throttle, notify, and evenly spread the load across multipaths, improving the utilization and TCO of these expensive GPUs with a VOQ fabric.

Arista’s Etherlink leverages our close partnership with Broadcom for AI-optimized silicon using the latest Jericho and Tomahawk families delivered in a 5nm process geometry. This assures the highest performance at the lowest power draw. Power savings really matter and are a pain point in large data centers, especially when building massive scale 400G or 800G networking infrastructure for AI clusters with thousands of GPUs. Every watt saved per chip silicon, line card, chassis system, as well as the associated accessories, be they pluggable and linear drive optics or cables, adds up.

AI for Networking Delivering Deep Insights

AI for Networking is achieved via our Arista EOS stack and using AVA™ (Autonomous Virtual Assist) AI to gain new insights using anonymized data from our global technical assistance center (TAC) database. Arista AVA imitates human expertise at cloud scale through an AI-based expert system that automates complex tasks like troubleshooting, root cause analysis, and securing from cyber threats. It starts with real-time, ground-truth data about the network devices' state and, if required, the raw packets. AVA combines our vast expertise in networking with an ensemble of AI/ML techniques, including supervised and unsupervised ML and NLP (Natural Language Processing). Applying AVA to AI networking increases the fidelity and security of the network with autonomous network detection and response and real-time observability. Our industry-leading software quality, robust engineering development methodologies, and best-in-class TAC yield better insights and flexibility for our global customer base.

Our EOS software stack is unmatched in the industry, helping customers build resilient AI clusters, with support for hitless upgrades, that avoids any downtime and thus maximize AI cluster utilization. EOS offers improved load balancing algorithms and hashing mechanisms that map traffic from ingress host ports to the uplinks so that flows are automatically re-balanced when a link fails. Our customers can now pick and choose packet header fields for better entropy and efficient load-balancing of AI workloads. AI network visibility is another critical aspect in the training phase for large datasets used to improve the accuracy of LLMs. In addition to the EOS-based Latency Analyzer that monitors buffer utilization, Arista’s AI Analyzer monitors and reports traffic counters at microsecond-level windows. This is instrumental in detecting and addressing microbursts which are difficult to catch at intervals of seconds.

Arista AI Strategy.pptx (1)

At the Forefront of AI

Arista is delivering both optimal Networking for AI platforms and AI for networking outcomes. AI Etherlink platforms deliver high performance, low latency, fully scheduled, lossless networking as the new unit of currency for AI networks. At the same time AI for networking drives positive outcomes such as security, root cause analysis and observability through AVA.

At Arista, we are proud to be at the forefront of building the absolute best networking infrastructure for the largest AI clusters in the world and delivering high fidelity business outcomes with AI/ML-assisted AVA. Generative AI promises to offer the potential to change our lives from rapid detection of cancer and Alzheimer’s disease to reducing incidents of fraud in financial services and better detection of illegal drug transportation that threatens public safety. As Arista celebrates the early milestones of Ethernet-based AI networking, it is gratifying to witness so many real-world use cases and possibilities for improving humanity! Welcome to the new world of AI networking!

References:

*“Networking for AI and AI for Networking” is a phrase coined by Zeus Kerravala, Principal Analyst at ZK Research
Meta Blog public announcement
AI White Paper
AVA White Paper
ZKast Video