2 min read

Redefining Cloud-Networking Resilience & Visibility

Picture of Jayshree Ullal Jayshree Ullal : Oct 1, 2013 7:36:33 PM

Jayshree Ullal 2013

Redefining Cloud-Networking Resilience & Visibility

In modern two-tier leaf-spine cloud networks, the increasing dominance of east-west traffic patterns, accompanied by the sheer volume of traffic and the increase in data rates from 10G to 40G/100G are combining to make it challenging to predict and analyze performance issues proactively. The scale involved in connecting 100K+ physical servers, 1M+ Virtual Machines and large big data storage elements is redefining expectations of a resilient network. Self-healing networks and new levels of visibility are no longer optional, but are, in fact, mandatory.

So how is today’s Network Software doing in these environments?

Self Healing & Programmable Software is Paramount

Most Network OS’s are unfortunately decades old and are designed for older enterprise data center applications of the 1990s. Traditional vendors do try valiantly to band-aid these legacy OS’s, but modern networks need a new ground-up architecture. Arista’s Extensible OS (EOS) is the ONLY purpose-built data center networking software to address this requirement. EOS was designed to support mission critical clouds and data centers as the primary goal. Brilliant engineers lead the EOS team and our SVP of Software Engineering and CTO, Ken Duda, pioneered the architecture. Indeed it is this engineering feat of software excellence that drew me to the company several years ago.

Published studies have shown that the operational costs of running a network are many times more than the capital expenditures over time. The cost of operational down-time from lack of visibility into the network infrastructure is estimated to be $5,600 per minute. This amounts to more than a million dollars for just several hours of outage. Cloud-scale operators must reduce downtime and detect, isolate, and resolve application performance problems proactively in order to meet their customer expectations and Service Level Agreements.

The secret sauce of Arista EOS is a multi-process state-sharing architecture which is self-healing and which exposes open APIs to enable programmability. EOS stores all system state in a central database (Sysdb) that holds and validates all system state and propagates updates. The schema-specific code in Sysdb is machine generated, providing the performance of hand-written code without the errors. The stateful publish-subscribe approach of EOS is intrinsically deterministic, borrowing heavily from the world of databases where state survives application shutdown. Many alternate data center vendors claim “improved” operating systems yet they deploy archaic message-passing schemes, where agents interact by sending messages back and forth to convey state, adding complexity and delays. Archaic check-pointing services are often deployed for restart only, which can be error-prone. This is because agents read their checkpoints only during a restart, not all the time. Initialization as well as the restart of agents within EOS is handled consistently through the same repository without reliance on recovery.

Virtual to Physical to Application Visibility

To improve down time and save costs, dynamic network troubleshooting and monitoring tool sets are needed. We must provide both fine-grained visibility to application performance, and also more global network-wide monitoring capabilities. How can you capture, analyze and troubleshoot traffic between two virtual servers when there are literally hundreds of paths between the racks where servers are located and the exact location of the server is unknown?

Arista Network Telemetry works in conjunction with applications so that the network is not in the way anymore. It dramatically reduces application downtime and network operational costs through improved real-time system and network performance visibility, correlation to application behavior and advanced end-to-end path monitoring tools. This saves millions of dollars and hours of downtime. Arista Tracers are enhancements to the Arista Network Telemetry application that bring deeper application level visibility by integrating with distributed applications like Big Data, Cloud, and Virtualized environments (see Figure below).

Figure: Arista EOS Tracer Technologies
Examples such as Health Tracer, Path Tracer, VM Tracer and MapReduce Tracer redefine resiliency and visibility.

The programmable foundation of Arista EOS combined with Network Tracers provides a real-world solution to the real-world problems of cloud network visibility, monitoring and troubleshooting. It enables tight linkages between the physical, virtual and application infrastructure that result in considerable savings in operational expenditures.

Welcome to the new world of software defined cloud networking with increased visibility, and lower operational costs and reduced down time. I look forward to your comments at: feedback@arista.com