Bringing SWAG to Enterprise Campus Networking!
As client users, devices, and IoT continue to proliferate, the need for switching management and workload optimization across domains increases. Many...
Recently I attended the 50th golden anniversary of Ethernet at the Computer History Museum. It was a reminder of how familiar and widely deployed Ethernet is and how it has evolved by orders of magnitude. Since the 1970s, it has progressed from a shared collision network at 2.95 megabits in the file/print/share era to the promise of Terabit Ethernet switching in the AI/ML era. Legacy Ethernot* alternatives such as Token Ring, FDDI, and ATM generally get subsumed by Ethernet. I believe history is going to repeat itself for AI networks.
Intense Networking for AI Data Exchange
AI workloads are demanding on networks as they are both data and compute-intensive. The workloads are so large that the parameters are distributed across thousands of processors. Large Language Models (LLMs) such as GPT-3, Chinchilla, and PALM, as well as recommendation systems like DLRM and DHEN, are trained on clusters of many 1000s of GPUs sharing the "parameters" with other processors involved in the computation. In this compute-exchange-reduce cycle, the volume of data exchanged is so significant that any slowdown due to a poor/congested network can critically impact the AI application performance.
Ultra Ethernet Consortium
As AI applications drive inference and training with massive compute processors (GPUs/CPUs/TPUs), one must reimagine the high speed transit of mission-crucial workloads. The upshot is wire-rate delivery of large synchronized bursts of data using a familiar standards-based, Ethernet-based network.
Arista and the founding members (shown below) of the Ultra Ethernet Consortium, UEC, have set out on the mission to enhance the capabilities of Ethernet for AI and HPC. The advantage of Ethernet brings the economics of wide deployments, familiarity with tools, and support for merchant silicon track to Moore’s law and silicon geometries. UEC and the proven ability of the IEEE Ethernet standards will advance Ethernet across many L2/3, optical and physical layers.
AI at Scale Needs Ethernet at Scale
As AI jobs grow, the underlying Ethernet network needs to be designed for high speed and scale to increase the job completion rate. UEC is endorsing three improvements:
Time for a RDMA Reboot
AI workloads cannot tolerate delays; they can only complete a job after all flows are successfully delivered. It takes only one culprit worst-case link to throttle an entire AI workload. To predictably transfer vast amounts of data, AI networks need a transport protocol like TCP that works “out of the box.” Arista and Ultra Ethernet Consortium's founding members believe it is time to reconsider and replace RDMA (Remote Direct Memory Access) limitations. Traditional RDMA, as defined by InfiniBand Trade Association (IBTA) decades ago, is showing its age in highly demanding AI/ML network traffic. RDMA transmits data in chunks of large flows, and these large flows can cause unbalanced and over-burdened links.
It is time to begin with a clean slate to build a modern transport protocol supporting RDMA for emerging applications. The UET (Ultra Ethernet Transport) protocol will incorporate the advantages of Ethernet/IP while addressing AI network scale for applications, endpoints and processes, and maintaining the goal of open standards and multi-vendor interoperability.
Summary: Ethernet Rises to the Occasion
Generative AI Applications are pushing the envelope of networking scale akin to using all highway lanes simultaneously and efficiently. Once again, Ethernet will ultimately emerge as the winner in networking for AI. Together with IP, Ethernet will drive numerous use cases for AI training and inference. Scalable and efficient mechanisms implemented with packet spraying, flexible ordering and modern congestion control algorithms will be infused into AI-based Ethernet and IP networks. Welcome to the new decade of AI Networking! I welcome your views at feedback@arista.com.
*Ethernot term coined by Bob Metcalfe, one of the original pioneers of Ethernet
As client users, devices, and IoT continue to proliferate, the need for switching management and workload optimization across domains increases. Many...
Today marks the 20th anniversary of Arista! Over that time, our company has grown from nothing to #1 in Data Center Ethernet, a highly profitable...
We are excited to share that Meta has deployed the Arista 7700R4 Distributed Etherlink Switch (DES) for its latest Ethernet-based AI cluster. It's...