3 min read

Delivering Reliable AI and Cloud Networking

Delivering Reliable AI and Cloud Networking

The explosive growth of generative AI and the demands of massive-scale cloud architectures have fundamentally redefined data center networking infrastructure. These new network requirements call for systems that are exceptionally reliable and resilient, while addressing  unprecedented challenges in power, thermal management, signal integrity, and physical connectivity. Arista’s approach has always been a combination of high performance hardware and purpose built software to meet the radix, latency and reliability requirements of networks at scale. A critical aspect of this platform approach is the importance of a diagnostics layer that can validate, monitor, and troubleshoot both the passive and active components of these complex platforms.

Introducing Arista Blue Box: Engineered for Scale and NOS Optionality

Arista’s customers have long appreciated the need for high-quality products, deep network troubleshooting capabilities, and the ability to manage complex hardware platforms throughout their long lifecycle. A core component of Arista’s flagship EOS® is a rich Network Diagnostic Infrastructure called NetdiTM. Netdi leverages over 7,000 person-years of EOS development experience and is a hallmark of Arista’s engineering DNA, ensuring quality, reliability, and validation of physical and low-level features. 

Arista Blue Box is designed to offer a superior and compelling solution over commodity white box alternatives. While Arista EOS is the most robust open Network Operating System (NOS) for mission-critical deployments, many operators with in-house expertise may choose other Linux-based open-source alternatives, such as SONiC and FBOSS, for some of their use cases. Many operators opt for hybrid leaf-spine topologies where the leaf, based on an open-source NOS, is connected to an Arista EOS-based spine to achieve the best of both worlds. In this diverse NOS environment, Netdi software ensures the Arista Blue Box hardware is thoroughly validated in both roles. This consistent, world-class experience delivers reliable networks at scale and provides lifecycle advantages regardless of the operating system.

Delivering-Reliable-AI-and-Cloud-Networking

Figure 1: Netdi delivers validated NOS-agnostic quality and reliability for every Arista platform

Netdi Ensures High Quality Platforms from Concept to Scale 

Developing a new network platform presents significant technical and logistical challenges. The comprehensive Arista Netdi suite accelerates operational deployment with high quality and reliability. Our approach begins with the “right first time” engineering and regression testing framework, which is crucial to ensuring exacting quality standards at scale. Arista’s deep expertise in Signal Integrity (SI) and Power Integrity (PI) produces innovative hardware designs that ensure superior channel performance, as optics account for a larger share of the total network power budget. Arista is focused on low-power solutions, such as Linear-Drive Pluggable Optics (LPO), and best-in-class thermal management to deliver the lowest-power networking solutions. These solutions reduce Total Cost of Ownership (TCO) by minimizing power consumption. Arista designs are reinforced by a rigorous, multi-faceted validation process, including:

  • Stress Testing: Platforms undergo Electrical (EDVT), Mechanical (MDVT), Emissions and Radio Frequency Interference (EMI), Highly Accelerated Life Testing (HALT), and Reliability Demonstration Testing (RDT) on large sample sizes.
  • Scale Testing: A dedicated hardware fleet runs a 24/7 automated validation framework, executing over 250,000 automated tests daily. These tests validate platform stability under extreme conditions, including component failures, reboots, hotswaps, and link flaps, to guarantee successful large-scale deployment.
  • Lifecycle Management via Arista's software ecosystem, which provides a transparent and globally synchronized system to manage each product’s lifecycle. We centrally track milestones, manage engineering changes, and forecast builds for our contract manufacturers. We have developed custom tools and visualizations to identify supply chain issues, manage logistics, control design and component changes, as well as analyze manufacturing metrics. This integrated approach to engineering development, production, and supply chain planning ensures a high-quality product that can rapidly scale to high volumes.

From Good Enough to Better and Best Network Resilience and Operations

In complex, multi-NOS AI/Cloud environments, Netdi bridges the support gap that often emerges between hardware, software, and interconnect configurations. While White Boxes are sometimes “good enough”, Arista Netdi brings better capabilities with Blue Box and the best resilience with EOS and Netdi combinations. Examples include: 

  • Secure Boot: Arista’s custom bootloader provides a reliable and secure boot layer for any NOS, ensuring image redundancy and fallback to a known-good image if the primary is corrupted. The secure boot facilities leverage measured boot and hardware attestation via a tamper-resistant Trusted Platform Module (TPM).
  • SEU Resiliency: Netdi helps us design robust mechanisms to address Single Event Upsets (SEUs), momentary bit flips that can be catastrophic in AI workloads. Arista platforms include software and hardware-assisted mechanisms that rapidly detect and automatically correct these events. We rigorously test them with error injection and live neutron beam testing to validate software and hardware resilience to SEUs.
  • Enhanced Support: By generating a rich stream of data, from design validation to production analytics, the Netdi framework simplifies support and facilitates a shift from reactive Technical Assistance Center (TAC) models to proactive troubleshooting. Leveraging Arista Autonomous Virtual Assist (AVATM) and our state-based Network Data Lake (NetDLTM) architecture, our specialized TAC teams can provide faster, more accurate diagnoses to customer operations for a cohesive support model.

Commitment to Open Standards and Ease of Deployment

Arista’s commitment to open ecosystems, evidenced by deep engagement with the SAI community and our Premier Membership in the SONiC Foundation, is a natural part of Netdi’s platform agnosticism. We actively lead and contribute to standards and consortia like the Ultra Ethernet Consortium (UEC), OSFP MSA, and LPO MSA. This empowers customers with open choices while adding value to enable large cloud, AI or enterprise networks. Recently, we are proud to be a founding member of the Ethernet for Scale Up Networks (ESUN) Initiative alongside other industry luminaries.

The Arista Blue Box, based on Netdi, is another natural innovation rooted in our core DNA of quality, continuous integration, and infrastructure deployments. The Arista advantage lies in our commitment to how we build, not just what we build. Netdi is another pioneering suite of innovations that spans concept to deployment, ensuring the most resilient, high-performance Arista Blue Box design, and empowering customers to build and operate demanding networks at scale and with confidence.

Welcome to the new world of better and best networking with Netdi!

References:

Netdi White Paper

Delivering Reliable AI and Cloud Networking

Delivering Reliable AI and Cloud Networking

The explosive growth of generative AI and the demands of massive-scale cloud architectures have fundamentally redefined data center networking...

Read More
The Cognitive Campus Blueprint for Enterprise Networking

The Cognitive Campus Blueprint for Enterprise Networking

The modern enterprise is navigating a profound transformation. The shift to the 'all wireless office' and 'coffee shop type networking', fueled by...

Read More
Arista and Palo Alto Networks Strengthen Partnership in the New Age of AI Security

Arista and Palo Alto Networks Strengthen Partnership in the New Age of AI Security

Data centers have evolved into highly distributed, hybrid ecosystems that span private clouds, public clouds, and colocation facilities. This...

Read More