Bringing SWAG to Enterprise Campus Networking!
As client users, devices, and IoT continue to proliferate, the need for switching management and workload optimization across domains increases. Many...
Artificial Intelligence, machine and deep learning have to be among the most popular tech-words of the past few years, and I was hoping that I wouldn’t get swept away by it. But when I heard a panel on this topic at our customer event this month on the state of AI networks, I found it incredibly fascinating and it piqued my curiosity! Let me start with a note of disclaimer for readers who are expecting a deep tutorial from me. There is a vast amount of research behind models and algorithms on this topic that I will not even attempt to cover. Instead I will try to share some thoughts on the practical relevance of this promising field.
Behind the Buzz Words
AI and machine learning have been around for a long time. The difference now is that there is much more powerful compute and network infrastructure available, along with exponentially more data to analyze. The criticality of efficient data movement includes a lifecycle of improvements in deep learning for ingesting, processing and inferring data, thereby creating higher layer abstractions for data scientists to quickly develop and train models.
The result is problems that were previously in the realm of impossible such as real-time language translation, fraud detection, and autonomous vehicle control, are being addressed through the use of neural network models, detecting patterns and behaviors across huge amounts of structured and unstructured data. As an example, an AI program that learns Van Gogh paintings can match with similar new paintings. While the human brain may be better in detecting deeper meaning and “conscious thought”, AI is radically increasing the benefits of “raw intelligence.” The continuing goal is to minimize the cycle time, both for the development of new algorithms and models and then to scale AI applications to serve billions of devices in real-time. Scientists can now reduce their research time from years to hours for trials and studies. Machine learning algorithms are typically implemented as floating point, which is why NVIDIA GPUs have been so popular here. This is combined with inference that is typically done in integer logic. This combination delivers the most machine learning and inference performance at the lowest cost and power. It also allows these systems to be tuned for AI applications.
The Network Relevance
Within a typical AI appliance, multiple GPUs are interconnected with very high-speed chip-to-chip interfaces. The NVIDIA DGX-1 with Volta system can interconnect 4 GPU chips with NVLink into a cube-mesh topology, which is then packaged together with general purpose CPUs. 100G Ethernet and RDMA over Converged Ethernet (RoCE) can be used to enable any GPU in the network to access any other GPUs memory. The high-performance Ethernet network used between DGX systems also communicates to storage devices (such as Pure Storage FlashBlade) and the DGX-1 servers, vastly simplifying AI system configuration and deployment. The NVIDIA DGX-1 system starts at 4 100G networking ports that deliver a total of 400G or 50 Gigabytes/sec of throughput, which is 4 times as much network bandwidth compared to general purpose servers in cloud networks.
Cognitive Networking Implications
AI servers together with an Arista leaf-and-spine network and storage appliances can form an important AI nucleus. We have tested these solutions with NVIDIA and Pure Storage to offer the highest IO density per appliance. The common theme is both AI storage and networking need insatiable bandwidth to feed the powerful applications. The NVIDIA DGX-1 system is just a 3U footprint with 4 100G interfaces to ingest up to 100 Gigabytes/second.
Cloud titans may migrate easily between different kinds of AI workloads without compromising AI applications. This improves monetization to optimize the ad or movie they are recommending to drive real time user experiences. Yet the potential goes beyond the cloud to enterprises as well. In small steps, Arista has already begun its journey through CloudVision’s® machine learning implementations. If there is an abnormal traffic rate, anomalies are quickly pinpointed and corrected.
At Arista we are at the cusp of building new, transformative technologies in our Arista EOS® architecture for machine learning, telemetry and failure mitigation. I am excited by the prospects ahead in the decade of transformation and innovation. Welcome to 2018 and the age of cognitive cloud networking.
References:
As client users, devices, and IoT continue to proliferate, the need for switching management and workload optimization across domains increases. Many...
Today marks the 20th anniversary of Arista! Over that time, our company has grown from nothing to #1 in Data Center Ethernet, a highly profitable...
We are excited to share that Meta has deployed the Arista 7700R4 Distributed Etherlink Switch (DES) for its latest Ethernet-based AI cluster. It's...