Systems Approach Since our recent posts have been retracing the history, and recounting the lessons, of software-defined networking, it seems like a good time to do the same with the other sea-change in networking over the past decade: network functions virtualization, or NFV.
Most people trace NFV back to the call-to-action [PDF] several network operators presented at the Layer123 SDN & OpenFlow World Congress in October 2012.
My involvement happened over several months before that presentation, working with colleagues from BT, Intel, HP, Wind River, and TailF on a proof-of-concept that gave the idea enough credibility for those operators to put the NFV stake in the ground.
What is NFV?
Simply put, network functions virtualization is a way to run various network services inside a collection of virtual machines that run on host servers. That means you can build, say, a telco operation or your enterprise network backbone out of commodity boxes, and run all your routing, caching, and monitoring workloads, and whatnot, in VMs on that hardware. Your network can be scaled up as needed, you don't need dedicated expensive hardware, and you can form your own sorta cloud. Or so the dream went.
At the time, I was at CDN software startup Verivue, which contributed what we believed to be the canonical virtualized network function (VNF) — a scalable web cache. Don Clarke, who led the entire effort on behalf of BT, has an excellent 4-part history of the journey, of which my own direct experience was limited to part one.
Two things stand out about the proof-of-concept. The first is the obvious technical goodness of colocating an access network technology (such as a virtualized broadband gateway) and a cloud service (such as a scalable cache) on a rack of commodity servers. The proof-of-concept itself, as is often the case, was hamstrung by our having to cobble together existing software components that were developed for different environments. But it worked and that was an impressive technical accomplishment.
The second is that the business minds in the room were hard at work building a return-on-investment case to pair with the technical story. Lip service was given to the value of agility and improved feature velocity, but the cost-savings were quantifiable, so they received most of the attention.
Despite being widely heralded at the start, NFV has not lived up to its promise. There have been plenty of analyses as to why (for example, see here and here), but my take is pretty simple.
The operators were looking at NFV as a way to run purpose-built appliances in virtual machines on commodity hardware, but that's the easy part of the problem. Simply inserting a hypervisor between the appliance software and the underlying hardware might result in modest cost savings by enabling server consolidation, but it falls far short of the zero-touch management win that modern datacenter operators enjoy when deploying cloud services.
In practice, telco operators still had to deal with N one-off VM configurations to operationalize N virtualized functions. The expectation that NFV would shift their operational challenge from caring for pets to herding cattle did not materialize; the operators were left caring for virtualized pets.
Despite being widely heralded at the start, NFV has not lived up to its promise
Streamlining operational processes is hard enough under the best circumstances, but the operators approached the problem with the burden of preserving their legacy O&M practices (ie, they were actively avoiding changes that would enable streamlining). In essence, the operators set out to build a telco cloud through piecemeal adoption of cloud technologies (starting with hypervisors). As it turned out, however, NFV set in motion a second track of activity that is now resulting in a cloud-based telco. Let me explain the difference.
Looking at the NFV PoC with the benefit of hindsight, it's clear that standing up a small cluster of servers to demo a couple of VNFs side-stepped the real challenge, which is to repeat that process over time, for arbitrarily many VNFs. This is the problem of continuously integrating, continuously deploying, and continuously orchestrating cloud workloads, which has spurred the development of a rich set of cloud native tools including Kubernetes, Helm, and Terraform.
Such tools weren't generally available in 2012, although they were emerging inside hyperscalers, and so the operators behind the NFV initiative started down a path of (a) setting up an ETSI-hosted standardization effort to catalyze the development of VNFs, and (b) retrofitting their existing O&M mechanisms to support this new collection of VNFs. Without evaluating the NFV reference architecture point-by-point, it seems fair to say that wrapping a VNF in an element management system (EMS), as though it were another device-based appliance, is a perfect example of how such an approach does not scale operations.
Meanwhile, the laudable goal of running virtualized functions on commodity hardware inspired a parallel effort that existed entirely outside the ETSI standardization process: to build cloud native implementations of access network technologies, which could then run side-by-side with other cloud native workloads. This parallel track, which came to be known as central office rearchitected as a datacenter (CORD), ultimately led to Kubernetes-based implementations of multiple access technologies (eg. PON/GPON and RAN). These access networks run as microservices that can be deployed by a Helm Chart on your favorite Kubernetes platform, typically running at the edge (eg, Aether).
Again, with the benefit of hindsight, it's interesting to go back to the two main arguments for NFV — lower costs and improved agility — and see how they have been overtaken by events. On the cost front, it's clear that solving the operational challenge was a prerequisite for realizing any capex savings. What the cloud native experience teaches us is that a well-defined CI/CD toolchain and the means to easily extend the management plane to incorporate new services over time is the price of admission to take advantage of cloud economics.
On the agility front, NFV's approach was to support service chaining, a mechanism that allows customers to customize their connectivity by "chaining together" a sequence of VNFs.
Since VNFs run in VMs, in theory, it seemed plausible that one could programmatically interconnect a sequence of them. In practice, providing a general-purpose service chaining mechanism proved elusive. This is because customizing functionality is a hard problem in general, but starting with the wrong abstractions (a bump-in-the-wire model based on an antiquated device-centric worldview) makes it intractable.
It simply doesn't align with the realities of building cloud services. The canonical CDN VNF is a great example. HTTP requests are not tunneled through a cache because it was (virtually or physically) wired into the end-to-end chain, but instead, a completely separate Request Redirection service sitting outside the data path dynamically directs HTTP GET messages to the nearest cache. (Ironically, this was true during the PoC since the Verivue CDN was actually container-based and built according to cloud native principles, even though it pre-dated Kubernetes.)
A firewall is another example: in a device-centric world, a firewall is a "middlebox" that might be inserted in a service chain, but in the cloud, equivalent access-control functionality is distributed across the virtual and physical switches.
When we look at the service agility problem through the lens of current technology, a service mesh provides a better conceptual model for rapidly offering customers new functionality, with connectivity-as-a-service proving to be yet another cloud service.
But the bigger systems lesson of NFV is that operations need to be treated as a first-class property of a cloud. The limited impact of NFV can be directly traced to the reluctance of its proponents to refactor their operational model from the outset. ®
Larry Peterson and Bruce Davie are the authors of Computer Networks: A Systems Approach and the related Systems Approach series of books. All their content is open source and available on GitHub. You can find them on Twitter, their writings on Substack, and past The Register columns here.Get our Tech Resources