3 min read

The Sun Rises on Scale-Up Ethernet

The Sun Rises on Scale-Up Ethernet

Co-Authored by Hugh Holbrook, Chief Development Officer

As the demands of AI and cloud networking push data center infrastructure to its limits, operators need networks that are not only high-performing and extremely reliable but also adaptable to the latest advancements in power, thermal management, and physical connectivity for dense clusters of tightly coupled AI accelerators. The explosive growth and rapid advancement of large language models (LLMs) has introduced complex training and inference workloads that generate massive, synchronized “scale-up” communication between hundreds to thousands of accelerators. This creates a need for tightly integrated, scale-up networks, providing extremely high bandwidth and low latency connectivity. It is critical to simultaneously embrace an open ecosystem that offers system, accelerator, and data center designers the flexibility to choose a transport layer optimized for their deployment and application.

Today there are many options including PCIE, CXL, and NVlink that create disparate islands for compute I/O. But existing solutions are clearly not optimized for the open, interoperable needs of scale-up. The industry needs that ultra-high-speed, low-latency interconnect fabric that allows AI processing units or accelerators (XPUs), within one or more racks, to function as a unified compute system, while preserving the benefits of an open standards-based solution. Once again, Ethernet is expected to be the consistent winner and equalizer for scale-up networking just as it is today for scale-out and scale-across. Some common characteristics of a scale-up network enable specific optimizations:

  • Highest bandwidth and lowest latency required within a cluster of hundreds to thousands of XPUs
  • Single-hop topology
  • Reliable, in-order delivery is expected

These characteristics allow the network and transport layer to be optimized, resulting in smaller headers and a simpler protocol, enabling unified, low-overhead memory access among XPUs, to support many forms of collectives.

Introducing ESUN: Ethernet for Scale-Up Networking 

Recognizing the importance of addressing real-world AI use cases, an ecosystem of industry leaders consisting of AMD, Arista, ARM, Broadcom, Cisco, HPE, Marvell, Meta, Microsoft, Nvidia, OpenAI, and Oracle have joined together to jump-start the ESUN initiative within OCP. Unveiled at the OCP Global Summit in October 2025, Ethernet for Scale-Up Networks is an open OCP workstream committed to the goal of open standards-based solutions for scale-up, based on Ethernet, and open to all. It will leverage the work of IEEE and UEC for Ethernet when possible, with the building blocks in three layers, as shown in the figure below.

  1. Common Ethernet Headers for Interoperability: ESUN will build on top of Ethernet in order to enable the widest range of upper-layer protocols and use cases.
  2. Open Ethernet Data Link Layer: Provides the foundation for AI collectives with high-performance at XPU cluster scale. By selecting standards-based mechanisms (such as LLR, PFC and CBFC) ESUN enables cost-efficiency and flexibility with performance for these networks. Even minor delays can stall thousands of concurrent operations.
  3. Ethernet PHY Layer: By relying on the ubiquitous Ethernet physical layer, interoperability across multiple vendors and a wide range of optical and copper interconnect options is assured.

Sun-Rises-On-Scale-Up-Ethernet-JU-Blog-1

Figure: At the heart of ESUN is a modular framework for Ethernet scale-up with defined Ethernet Headers, Ethernet Data Link layer functions and well understood Ethernet PHYs, as three key building blocks supported by 12 industry experts.

ESUN is designed to support any upper layer transport, including one based on SUE-T. SUE-T (Scale-Up Ethernet Transport) is a new OCP workstream, seeded by Broadcom’s contribution of SUE (Scale-Up Ethernet) to OCP. SUE-T looks to define functionality that can be easily integrated into an ESUN-based XPU for reliability scheduling, load balancing, and transaction packing, which are critical performance enhancers for some AI workloads.

ESUN Workstreams for Powerful Compute Networks 

In essence, the ESUN framework enables a collection of individual accelerators to become a single, powerful AI super computer, where network performance directly correlates to the speed and efficiency of AI model development and execution. The layered approach of ESUN and SUE-T over Ethernet promotes innovation without fragmentation. XPU accelerator developers retain flexibility on host-side choices such as access models (push vs. pull, and memory vs streaming semantics), transport reliability (hop-by-hop vs. end-to-end), ordering rules, and congestion control strategies while retaining system design choices. The ESUN initiative takes a practical approach for iterative improvements. Initial candidate focus areas are:

  • L2/L3 Framing - Encapsulating Al headers in Ethernet for low-latency, high-bandwidth workloads.
  • Error Recovery - Detecting and correcting bit errors without compromising performance.
  • Efficient Headers - Optimized headers to improve wire efficiency.
  • Lossless Transport - Leveraging standard mechanisms to prevent congestion drops in the network, critical for some Al workloads.

By aligning with the initial ecosystem of twelve prestigious industry leaders, we help our community of customers, standards bodies, and vendors to converge quickly on specifications and implementations that matter most for practical use cases, enabling fast iteration as requirements evolve.

Welcome to the new era of ESUN – Ethernet for Scale-Up Networking!

References:

OCP ESUN BLOG

Netdi White Paper

The Sun Rises on Scale-Up Ethernet

The Sun Rises on Scale-Up Ethernet

Co-Authored by Hugh Holbrook, Chief Development Officer

Read More
Generative and Agentic AI Networking Revolution

Generative and Agentic AI Networking Revolution

Co-Authored by Fred Hsu, Distinguished Solutions Engineer The world of technology is in the midst of a shift, driven by the rapid advancements in...

Read More
CloudVision: The First Decade

CloudVision: The First Decade

CloudVision was originally announced as a product on June 23, 2015 and now in 2025 CloudVision is officially ‘double-digits’! We received some...

Read More