2 min read

How Robust is your cloud network?

Picture of Jayshree Ullal Jayshree Ullal : Mar 1, 2009 11:14:53 PM

Jayshree Ullal 2009

The IT industry has touted networking OS advantages for many generations. Cisco was the pioneer and leader in the 1990s with Cisco IOS, rich in features, albeit highly monolithic. In the second decade, both Juniper and Cisco developed targeted service provider-class software, with more modularity with JUNOS and IOS-XR. As we enter the third decade and era of datacenter and cloud computing, the scale of connecting 10,000-100,000 clusters of compute and storage elements redefines the expectation of a resilient network.

So how really robust are software architectures of today's data centers? Can they cope with the demands of emerging clouds and scale? All depends on how well they are inherently designed and architected. A new ground-up architecture that handles reboots and intrusive restarts responsively is imperative. This, in my view, simply cannot be achieved with incremental enhancements to existing OS. Arista's Extensible OS (EOS) and Cisco's NX-OS are the only two illustrations of purposed-built data center networking software that at least I am aware of.EOS's architectural design is perhaps best illustrated by a real customer experience recalled fondly by our VP of Software Engineering, Ken Duda. During early field trials with our 71XX products, one customer observed a rather high bit-error rate on links between our switch and a third party 10GE NIC. We immediately rolled in new settings into a patch in the form of an updated PHY driver, which was then installed live by our customer. Our EOS restarted the new agent within one second while all other parameters in the switch remained un-changed and transparent to the fix. The customer was delighted with this type of live patching without scheduled down time in a data center. "Vendors have been promising me this type of OS for years, and finally one has delivered!" he declared approvingly.The secret-sauce of Arista's EOS is a multi-process state-sharing architecture. EOS defines its state using central database (Sysdb) that holds and validates all system state and propagates updates to the agents. The schema-specific code in Sysdb is machine generated, providing the performance of hand-written code without the errors. This stateful approach of EOS is intrinsically more deterministic, borrowing heavily from the world of databases where the state survives application failure.Alternate data center architectures may deploy a message-passing scheme, where agents interact by sending messages back and forth to convey state adding delays. Check-pointing services are often deployed for restart only, which can be error-prone. This is because agents read their checkpoints only in a restart not all the time. EOS is different and unique. Initialization and restart of agents are both handled consistently through the same stateful repository without reliance on recovery code.Arista EOS customer Paul Mullen, Project Manager of Heanet in Ireland sums it up well. "It was immediately clear that Arista's 7100 is a game changer delivering the kind of performance and availability that heretofore would not have been possible"I am excited too! Welcome to the new decade of cloud and data center networking delivering self-healing innovations never seen before in classic networks….

I welcome your comments at feedback@arista.com