CloudVision: The First Decade
As I think about the evolution of the CloudVisionⓇ platform over the last 10 years, and our latest announcement today, I’m reminded of three...
4 min read
Jayshree Ullal : Mar 25, 2024 9:00:00 AM
As we all recover from NVIDIA’s exhilarating GTC 2024 in San Jose last week, AI state-of-the-art news seems fast and furious. Nvidia’s latest Blackwell GPU announcement and Meta’s blog validating Ethernet for their pair of clusters with 24,000 GPUs to train on their Llama 3 large language model (LLM) made the headlines. Networking has come a long way, accelerating pervasive compute, storage, and AI workloads for the next era of AI. Our large customers across every market segment, as well as the cloud and AI titans, recognize the rapid improvements in productivity and unprecedented insights and knowledge that AI enables. At the heart of many of these AI clusters is the flagship Arista 7800R AI spine.
Robust Networking for AI
Activating these new AI use cases requires the LLMs to be trained first. These backend AI training clusters require a fundamentally new approach to building networks, given the massively parallelized workloads characterized by elephant traffic flows that can cause congestion throughout the network, impacting job completion time (JCT) measured across the entire workload. Traffic congestion in any single flow can lead to a ripple effect slowing down the entire AI cluster, as the workload must wait for that delayed transmission to complete. AI clusters must be architected with massive capacity to accommodate these traffic patterns from distributed GPUs, with deterministic latency and lossless deep buffer fabrics designed to eliminate unwanted congestion.
Arista’s Etherlink for Standards Compatibility
As the Ultra Ethernet Consortium (UEC) completes its extensions to improve Ethernet for AI workloads, Arista assures customers that we can offer UEC-compatible products, easily upgradable to the standards as UEC firms up in 2025. To quote Meta, “Through careful co-design of the network, software, and model architectures, we have successfully used both Ethernet/RoCE and InfiniBand clusters for large GenAI workloads (including our ongoing training of Llama 3) without any network bottlenecks.” This validates that lossless Ethernet can not only meet the rigorous baseline to host AI workloads but can also evolve to support the UEC open standards when they are available.
Arista Etherlink™ is standards-based Ethernet with UEC-compatible features. These include dynamic load balancing, congestion control, and reliable packet delivery to all NICs supporting RoCE. Arista Etherlink will be supported across a broad range of 800G systems and line cards based on Arista EOSⓇ. As the UEC specification is finalized, Arista AI platforms will be upgradeable to be compliant.
Arista’s Etherlink platforms offer three important network characteristics:
Arista’s Etherlink leverages our close partnership with Broadcom for AI-optimized silicon using the latest Jericho and Tomahawk families delivered in a 5nm process geometry. This assures the highest performance at the lowest power draw. Power savings really matter and are a pain point in large data centers, especially when building massive scale 400G or 800G networking infrastructure for AI clusters with thousands of GPUs. Every watt saved per chip silicon, line card, chassis system, as well as the associated accessories, be they pluggable and linear drive optics or cables, adds up.
AI for Networking Delivering Deep Insights
AI for Networking is achieved via our Arista EOS stack and using AVA™ (Autonomous Virtual Assist) AI to gain new insights using anonymized data from our global technical assistance center (TAC) database. Arista AVA imitates human expertise at cloud scale through an AI-based expert system that automates complex tasks like troubleshooting, root cause analysis, and securing from cyber threats. It starts with real-time, ground-truth data about the network devices' state and, if required, the raw packets. AVA combines our vast expertise in networking with an ensemble of AI/ML techniques, including supervised and unsupervised ML and NLP (Natural Language Processing). Applying AVA to AI networking increases the fidelity and security of the network with autonomous network detection and response and real-time observability. Our industry-leading software quality, robust engineering development methodologies, and best-in-class TAC yield better insights and flexibility for our global customer base.
Our EOS software stack is unmatched in the industry, helping customers build resilient AI clusters, with support for hitless upgrades, that avoids any downtime and thus maximize AI cluster utilization. EOS offers improved load balancing algorithms and hashing mechanisms that map traffic from ingress host ports to the uplinks so that flows are automatically re-balanced when a link fails. Our customers can now pick and choose packet header fields for better entropy and efficient load-balancing of AI workloads. AI network visibility is another critical aspect in the training phase for large datasets used to improve the accuracy of LLMs. In addition to the EOS-based Latency Analyzer that monitors buffer utilization, Arista’s AI Analyzer monitors and reports traffic counters at microsecond-level windows. This is instrumental in detecting and addressing microbursts which are difficult to catch at intervals of seconds.
At the Forefront of AI
Arista is delivering both optimal Networking for AI platforms and AI for networking outcomes. AI Etherlink platforms deliver high performance, low latency, fully scheduled, lossless networking as the new unit of currency for AI networks. At the same time AI for networking drives positive outcomes such as security, root cause analysis and observability through AVA.
At Arista, we are proud to be at the forefront of building the absolute best networking infrastructure for the largest AI clusters in the world and delivering high fidelity business outcomes with AI/ML-assisted AVA. Generative AI promises to offer the potential to change our lives from rapid detection of cancer and Alzheimer’s disease to reducing incidents of fraud in financial services and better detection of illegal drug transportation that threatens public safety. As Arista celebrates the early milestones of Ethernet-based AI networking, it is gratifying to witness so many real-world use cases and possibilities for improving humanity! Welcome to the new world of AI networking!
As I think about the evolution of the CloudVisionⓇ platform over the last 10 years, and our latest announcement today, I’m reminded of three...
In 1984, Sun was famous for declaring, “The Network is the Computer.” Forty years later we are seeing this cycle come true again with the advent of...
Paradigm Shift to Zero Trust Networking