Powering All Ethernet AI Networking
Artificial Intelligence (AI), powered by accelerated processing units (XPUs) like GPUs and TPUs, is transforming industries. The network...
4 min read
Vijay Vusirikala and John Peach
:
Jun 12, 2025 6:00:00 AM
Artificial Intelligence (AI), powered by accelerated processing units (XPUs) like GPUs and TPUs, is transforming industries. The network interconnecting these processors is crucial for efficient and successful AI deployments. AI workloads, involving intensive training and rapid inferencing, require very high bandwidth interconnects with low and consistent latency, and the highest reliability to maximize XPU utilization and reduce AI job completion time (JCT). A best-of-breed network with AI-specific optimizations is critical for delivering AI applications, with any JCT slowdown leading to revenue loss. Typical workloads have fewer, very high-bandwidth, low-entropy flows that run for extended periods, exchanging large messages synchronously, necessitating advanced lossless forwarding and specialized operational tools. They differ from cloud networking traffic as summarized below:
Figure 1: Comparison of AI workloads with traditional cloud networking
AI Centers: Building Optimal AI Network Designs
With 30-50% of processing time spent exchanging data over networks, the economic impact of network performance in AI clusters is significant. Network bottlenecks lead to idle cycles on XPUs, wasting both the capital investment in processing and operational expenses on power and cooling. An optimal network is therefore critical to the function of an AI Center.
AI Centers consist of scale-out and scale-up network architectures. Scale-out networks are further divided into front-end and back-end networks.
Figure 2: AI Centers are built on Scale-Up and Scale-Out Networks
Arista champions open, standards-based (defined by Ultra Ethernet Consortium) networks as the foundation of the universal high-performance AI center, leveraging the vast Ethernet ecosystem's benefits: diverse platform choices, cost-effectiveness, rapid innovation, a large talent pool, mature manageability, power-efficient hardware, proven software stack, and investment protection.
Arista’s solutions address the entire AI data path, from scale-up interconnects within server racks to scale-out front-end to back-end, as well as data center interconnects across a campus or wide area region, all managed by Arista’s flagship extensible operating system (EOSⓇ) and management plane (CloudVisionⓇ).
Arista provides a best-of-breed choice of ultra-high-performance, market-leading Ethernet switches optimized for scale-out AI networking. Arista caters to all sizes, from easy-to-deploy 1-box solutions that scale from tens of accelerators to over a thousand, to efficient 2-tier and 3-tier networks for hundreds of thousands of hosts, as shown in Figure 3.
Figure 3: Compelling Arista solutions for scale-out networking
Three EtherlinkTM product families and over 20 products deliver choices of form factors and deployment models, and drive many of the largest and most soaphisticated cloud/AI-titan and enterprise AI networks today. These products are also compatible with Ultra Ethernet Consortium (UEC) networks. Current systems are based on low-power 5nm silicon technology and support Linear Pluggable Optics (LPO) and Extended Reach DAC Cables to reduce power and lower cost.
Advent of Scale-Up AI Ethernet Fabrics
While Arista’s Etherlink scale-out networks connect large-scale servers, scale-up fabrics address the ultra-high-speed and low-latency interconnect system within a single server or rack-scale system, connecting accelerators directly. This is critical for efficient memory-semantic communication and coordinated computing across multiple accelerator units within a tightly coupled environment, as shown in Figure 4 below.
Figure 4: Ethernet-based scale-up connectivity
Key requirements for scale-up networks include very high bandwidth (8-10x the bandwidth of back-end scale-out network per GPU), lossless operation, fine-grained flow control, high bandwidth efficiency, and ultra-low latency. These features optimize inter-XPU communication, enabling shared memory access across multiple XPUs. This architecture supports latency-sensitive parallelism techniques, including data, tensor, and expert parallelism, across these XPUs. Key advancements are being developed to enhance Ethernet for scale-up applications. These include Link Layer Retry (LLR) and Credit-Based Flow Control (CBFC), which aim to provide more precise congestion management and ensure lossless performance scaling within networks.
Accelerating AI Centers with Agentic AI
Generative and agentic AI are pushing the envelope of networking for AI. Arista is at the forefront of Ethernet solutions for scale-up (which has historically been proprietary) and scale-out interconnects, delivering on the need for simpler transport, low latency, highest reliability, and reduced software overhead. This evolution promises an open, interoperable, and unified fabric future for all segments of AI networking infrastructure.
Emerging AI applications also need a robust AI network. Arista's EOS and CloudVision provide the network software intelligence and incorporate specific features optimized for AI workloads. Arista’s Network Data Lake (NetDLTM) is a centralized repository ingesting high-fidelity telemetry from Arista platforms, third-party systems, server NICs, and AI job schedulers. NetDL forms the foundation for AI-driven network automation and optimization. Key capabilities of Arista software suite for AI networks include:
Advanced Load Balancing: EOS offers Dynamic Load Balancing (DLB) considering real-time link load, RDMA-Aware Load Balancing using Queue Pairs for better entropy, and Cluster Load Balancing (CLB), a global RDMA-aware solution purpose-built to identify collective communications and optimize flow placement and low tail latency,
Robust Congestion Management: EOS implements Data Center Quantized Congestion Control (DCQCN) with Explicit Congestion Notification (ECN) (queue-length and latency-based) and Priority Flow Control (PFC) with RDMA-Aware QoS to ensure lossless RoCEv2 environments.
AI Job Observability: Correlates AI job metrics with granular, real-time network telemetry for an end-to-end view, anomaly detection, and accelerated troubleshooting.
Powering AI and Data Centers
The evolution of AI interconnects is clear and trending towards open, Ethernet-based solutions. Organizations prefer open, standards-based architectures, and Ethernet-based solutions offer continuous evolution in the pursuit of higher performance. A unified architecture, from cluster to client, with rich telemetry maximizes application performance, data security, and end-user experience while optimizing capital and operational costs through right-sized, reusable infrastructure and protecting investment with the flexibility to adapt to emerging technologies. Welcome to the new era of All Ethernet AI Networking!
Artificial Intelligence (AI), powered by accelerated processing units (XPUs) like GPUs and TPUs, is transforming industries. The network...
AI models have rapidly evolved from GPT-2 (1.5B parameters) in 2019 to models like GPT-4 (1+ trillion parameters) and DeepSeek-V3 (671B parameters,...
As client users, devices, and IoT continue to proliferate, the need for switching management and workload optimization across domains increases. Many...