3 min read

Demystifying Ultra Ethernet

Demystifying Ultra Ethernet

The Ultra Ethernet Consortium (UEC), of which Arista is a founding member, is a standards organisation established to enhance Ethernet for the demanding requirements of Artificial Intelligence (AI) and High-Performance Computing (HPC). Over 100 member companies and 1000 participants have collaborated to evolve Ethernet, leading to the recent publication of its 1.0 specification, which will drive hardware implementations that significantly boost cluster performance.

Demystifying-Ultra-Ethernet-1

Fig.1 UEC Goals and Founding Members

In this blog, we will take a look at the need for Ultra Ethernet and the new capabilities it delivers.

Historically, AI/ML clusters have been specialist, independent technology islands. As AI/ML has become business-critical, there is a need for a common technology paradigm that integrates with existing enterprise fiscal, operational, and security frameworks. Ethernet and IP have a proven history of adapting over 50 years, and advanced Ethernet networking solutions, such as Arista's Etherlink™ portfolio, are already the chosen interconnect for the majority of AI accelerators (XPUs).

A central element of the UEC's vision is to take Ethernet performance to the next level by reimagining Remote Direct Memory Access (RDMA) as a native Ethernet application. RDMA is vital for the success of both AI and HPC applications, as it enables systems and processors to directly exchange data at high speed, currently 400 Gbps, with 800 Gbps in the near future. This efficient communication facilitates the distribution of workloads across numerous servers and processors, supporting parallel computation across many thousands of accelerators.

RDMA entails high flow rates and synchronized large-volume flows that pose challenges for unoptimized Ethernet networks. Without advanced switching features, large flows created hashing nightmares, requiring almost perfect traffic distribution to prevent congestion. The rapid startup and termination of RDMA flows offered traditional congestion control algorithms little time to react. While enhancements like Arista's Etherlink already substantially improve performance beyond alternative proprietary approaches, the next level of universal optimization necessitates a rethinking of how applications interact with the network.

This is where Ultra Ethernet Transport (UET) comes in, designed to make RDMA a native Ethernet application by incorporating new traffic distribution semantics and modern congestion control on top of standard Ethernet and IP layers. UET aims to meet the demands of contemporary and traditional HPC workloads without requiring proprietary infrastructure.

Demystifying-Ultra-Ethernet2-3

Fig.2 UET Packet Format

Key Aspects of Ultra Ethernet Transport (UET)

UET addresses the limitations of traditional RDMA networking from several angles to provide a comprehensive new transport paradigm for both HPC and AI/ML workloads. We’ll take a look at some of the innovations below:

Table 1: Key Benefits of UET
Traditional RDMA Ultra Ethernet
RDMA tunneled over Ethernet Closely coupled API and transport
Single cluster scaling in tens of thousands Designed for scaling over 1M endpoints
No native security implementation Native highly scalable group-based encryption
Requires in order delivery Native support for out-of-order packet delivery
Multi-pathing at flow level Per-packet multipathing (spraying)
Inefficient go-back-N loss recovery Per-packet loss recovery
Coarse congestion management and recovery Fine-grained sender and receiver based congestion control
Inflexible network tuning paradigm Semantic-level configuration of workload tuning

 

Native Libraries: To achieve maximum performance, UET effectively implements a native transport layer for the ubiquitous libfabric 2.0 API. For many applications, the transition to UET is straightforward, requiring minimal or no application changes.

Optimized Traffic Forwarding: A fundamental concept of UET is the evolution from traditional flow-based traffic distribution to source-based packet spraying. Unlike proprietary solutions, UET is built from the ground up for packet spraying for all message types, ensuring optimal efficiency at every layer.

Advanced Connection and Congestion Management: Traditional methods of setting up new connections (e.g., 3-way handshake) are time and resource intensive. Congestion algorithms are optimized for general traffic patterns and recovering from packet loss triggers inefficient "go-back-N" operations, which require many packets to be resent, impacting both the sender and the receiver, as well as the network itself. UET provides significant optimization for all of these cases, including:

  • Ephemeral Connections: Enable fast connection startup, eliminating the round-trip handshake delay before data begins to flow.
  • Selective Retransmission: Enables retransmission of individual lost packets, reducing the network-wide impact of a dropped packet from full round-trip time to a single packet.
  • Packet Trimming: Efficiently notifies both receiver and sender of packet loss and congestion, allowing rapid mitigation and recovery.
  • Network Signal Congestion Control (NSCC): Sender-based algorithm that paces transmission rates upon detecting congestion.
  • Receiver Credit Congestion Control (RCCC): Receiver-based mechanism to manage "in-cast" scenarios by controlling sender traffic rates.

Security: Given the value of AI models and intellectual property, security of data in-flight is mandatory, especially in multi-tenant environments. UET treats security as a fundamental objective, offering optional end-to-end encryption and authentication based on an advanced group keying scheme that allows all members of a job (e.g., all XPUs for one tenant) to operate in an encrypted bubble, protecting model data from exposure and preventing data injection or exfiltration by other tenants on the network.

In summary, the UEC specification modernises the relationship between AI/HPC applications and networks. By tightly integrating application semantics with network behaviours, it creates a native transport mechanism that combines the strengths of RDMA with best-in-class Ethernet solutions, forming a powerful foundation for the next generation of applications.

Etherlink-Portfolio

Fig.3 Arista’s Etherlink Portfolio

Arista, as the leading provider of advanced Ethernet solutions for AI/ML clusters and a founding member of the UEC, is committed to this vision. With its current Etherlink portfolio already being UET-ready, and ongoing efforts to develop future systems and collaborate with other pioneers to build optimal Ethernet networks for high-performance computing, we look forward to cementing the leadership of Ethernet as a universal interconnect. For more details on UET, please review our whitepaper here.

References:

Demystifying Ultra Ethernet Whitepaper

The Ultra Ethernet Consortium Launches Specification 1.0

Ultra Ethernet Consortium

Ultra Ethernet Specifications

Ultra Ethernet Whitepaper

Arista 800G Portfolio

AI Networking Center

Arista Blog Site

 

Demystifying Ultra Ethernet

Demystifying Ultra Ethernet

The Ultra Ethernet Consortium (UEC), of which Arista is a founding member, is a standards organisation established to enhance Ethernet for the...

Read More
Next Generation SD-WAN in the AI Era

Next Generation SD-WAN in the AI Era

The advent of cloud native applications in the 2025 era (CRM, SaaS, storage, or ERP apps) and the public cloud has caused a re-architecture of...

Read More
Powering All Ethernet AI Networking

Powering All Ethernet AI Networking

Artificial Intelligence (AI), powered by accelerated processing units (XPUs) like GPUs and TPUs, is transforming industries. The network...

Read More