Subscribe to Blog Notification Emails

Latest Blog Post

The Arrival of AI Networking at Petascale

Jayshree Ullal
by Jayshree Ullal on Mar 1, 2023 6:00:00 AM

The AI industry has taken us by storm, bringing supercomputers, algorithms, data processing and training methods into the mainstream. The rapid ramp of large language inference models combined with Open AI's ChatGPT has captured the interest and imagination of people worldwide. Generative AI applications promise benefits to just about every industry. New types of AI applications are expected to improve productivity on a wide range of tasks, be it marketing image creation for ads, video games or customer support. These generative large language models with over 100 billion parameters are advancing the power of AI applications and deployments. Furthermore, Moore's law is pushing silicon geometries of TPU/GPU processors that connect 100 to 400 to 800 gigabits of network throughput with parallel processing and bandwidth capacity to match.

Data and Compute Intensive AI Workloads

Not only are AI/ML applications a huge driver of compute today, but the silicon industry is also keeping up with the demand by churning out scalable processors. These could be CPUs, GPUs, or TPUs optimized for workloads with parallel cores or specialized processors optimized for tensor and matrix computations, with memory and IO interfaces to match. A common characteristic of these workloads is that they are not only data but also compute-intensive. A typical structure of an AI workload involves a large sparse matrix computation, so large that the parameters of the matrix are distributed across hundreds or thousands of processors. Each processor involved performs intense computations for a period of time, sharing the "parameters" with other processors involved in the computation. Once the data from all peers is received, it can be reduced (or merged) with the local data, and another round of processing begins. This compute-exchange-reduce cycle increases the volume of data exchanged exponentially. A slowdown due to a suboptimal network can critically impact the application performance creating inefficient wait-states and idling away processor performance by 30% or more while wasting the efficiency of expensive GPUs. A modern, scalable AI network is imperative.

Ethernet for AI networking Scale

The mandate to avoid these idle states with massive processor density requires a specialized AI network with wire-rate delivery of large and synchronized bursts of data improving performance at speeds of 400/800G. One must rethink the network to scale to many hundreds upon thousands of racks of AI servers. High-performance and repetitive transit of data export and import are critical to these applications. In the past, this sort of performance existed only in the domain of specialized HPC networks such as InfiniBand. Today the combination of RDMA Ethernet NICs and RoCE (RDMA over converged Ethernet) allows Ethernet and IP to be used as the transport fabric without overhead. The advantage of Ethernet for AI networking is obvious with the economics of standards, a massive installed base, industry-wide interoperability and merchant silicon support, as shown in the AI network design guidelines below.

JU-AI-Blog

Arista 7800 AI Spine

The mega performance against any AI workload is best captured by the Arista 7800 as the premier AI spine. It delivers an unmatched combination of high-bandwidth, lossless, high radix fabric interconnecting hundreds and thousands of GPUs at speeds of 400/800G. Arista’s AI spine addresses key characteristics, including:

  • Hotspot-free load balancing to handle any size of elephant flow
  • Congestion-free flow control (PFC) from sender to receiver
  • Field-hardened control and notification with ECN implementations proven at scale to work for RDMA systems
  • Exceptional buffering flexibility
  • High radix to support large AI fabrics (up to 576 400G ports and 800G ahead)
  • Advanced quality of service and monitoring capabilities for flow control, load-balancing, latency, and memory usage

Arista AI spines bring a balanced combination of low power, predictable performance/latency and reliability characteristics for the most demanding AI workloads.

Arista EOS for AI Networking

The Arista 7800 AI spine is based on the flagship software stack Arista EOS, which is critical to handling enormous workloads. We deliver our advanced customers the optimal AI network assurance for their mission-critical workloads. By infusing AI properties into the programmable EOS, we can construct a reliable AI network for automation, visibility, resilience and dynamic controls. Examples of customizable dimensions include:

  • Dynamic load balancing designed to handle different GPU topologies and traffic conditions
  • AI Analyzer to monitor traffic counters at microsecond level time windows and handle microbursts due to the synchronized nature of AI/ML traffic flows
  • End-to-end congestion control to support RDMA congestion with PFC/ECN
  • Differentiated Quality of Service (QoS) to prioritize control traffic and separate it from RDMA traffic in a different queue
  • Advanced application and network monitoring capabilities such as watchdog, counters and latency analyzers

AI Networking at an Inflection Point

It is an exciting time at Arista as we look forward to helping our customers with their AI networking strategies. We deliver high scale bandwidth capacity with predictable workload performance for cloud networking. With Arista AI platforms, we continue to deliver the best combination of Ethernet versatility and IP protocol capabilities at petascale with unmatched, congestion free, lossless fabric for our customer’s AI strategies. The exponential growth of AI workloads as well as distributed AI processing traffic are placing explosive demands on the network traffic. Welcome to the new wave of petascale AI networking!

References:

Opinions expressed here are the personal opinions of the original authors, not of Arista Networks. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Arista Networks or any other party.

Jayshree Ullal
Written by Jayshree Ullal
As CEO and Chairperson of Arista, Jayshree Ullal is responsible for Arista's business and thought leadership in AI and cloud networking. She led the company to a historic and successful IPO in June 2014 from zero to a multibillion-dollar business. Formerly Jayshree was Senior Vice President at Cisco, responsible for a $10B business in datacenter, switching and services. With more than 40 years of networking experience, she is the recipient of numerous awards including E&Y's "Entrepreneur of the Year" in 2015, Barron's "World's Best CEOs" in 2018 and one of Fortune's "Top 20 Business persons" in 2019. Jayshree holds a B.S. in Engineering (Electrical) and an M.S. degree in engineering management. She is a recipient of the SFSU and SCU Distinguished Alumni Awards in 2013 and 2016.

Related posts

The New AI Era: Networking for AI and AI for Networking*

As we all recover from NVIDIA’s exhilarating GTC 2024 in San Jose last week, AI state-of-the-art news seems fast and furious....

Jayshree Ullal
By Jayshree Ullal - March 25, 2024
The Arrival of Open AI Networking

Recently I attended the 50th golden anniversary of Ethernet at the Computer History Museum. It was a reminder of how familiar...

Jayshree Ullal
By Jayshree Ullal - July 19, 2023
Network Identity Redefined for Zero Trust Enterprises

The perimeter of networks is changing and collapsing. In a zero trust network, no one and no thing is trusted from inside or...

Jayshree Ullal
By Jayshree Ullal - April 24, 2023