VMware NSX

14 min readMay 25, 2021

This is the abstract of a book about VMware NSX Platform, which I had started working on with a colleague back in 2019, that we never managed to finish. I thought that it would still be worth sharing this to make a mark in the history.

1. How NSX for vSphere revolutionised networking on ESX?

Let’ s take a step back and remember how things were a decade ago, before NSX for vSphere platform had been around. As some used to say in those days , “right click networking is what I need”. Application developers and businesses had been in desperate need to have a fast built infrastructure (almost in the blink of an eye) to serve the respective app. At the end of the day it is all about the business which is differentiated by unique user experience which is achieved by developing the best applications in the most agile way.

The reason that any individual in the IT industry can remember about networking is that it had been one of the major functions slowing down the whole service/application delivery process. Provisioning functions such as switching, routing, firewalling, load balancing just to serve that service/application were all based on manual processes taking ages. On the other hand there was right click workload provisioning based on compute virtualisation (ESX) by VMware. Things had already gotten so simple that imaging, preparing, delivering a workload as a virtual machine for the consumption of the developer was taking only a few minutes just because of the highly automated nature of those processes. The user experience was clearly starting to dictate the radical changes coming soon. One of the indicators of this was, around 2010 according to some independent researches, the number of virtual switchports in the datacenters across the globe had surpassed the physical ports. And the growth rate was almost exponential; simple reason is that the industry had already started taking huge financial benefits from compute virtualisation. Virtualisation itself had started fuelling some of the talks around digital transformation, agile, companies differentiating themselves by putting IT right at the centre of their businesses.

And… NSX came out in 2013, with the promise of a whole new network services delivery experience, only to address the main concern, which is a highly agile infrastructure to serve the app. VMware coined the term “network virtualisation” something which usually misused in the context of “Software Defined Networking” . The approach was simple, applying the same principles of compute virtualisation to networking; abstracting the networking function and delivering networks totally decoupled from the actual networking hardware itself. Just like instantiating virtual machines on any physical server infrastructure, NSX enabled organisations to create networks and security policies, on any kind of network infrastructure, purely in software / in the hypervisor layer.

The technical side was not complex at all, network virtualisation was just another form of tunnelling mechanism with a new terminology “overlay networking” which provides that decoupling from the underlay. In terms of security, hypervisor became the new enforcement layer for any virtual machine. Enabling the security policy to be as portable and mobile as the virtual machine itself. “Micro-segmentation” had been introduced as a new terminology. All these capabilities introduced many unique opportunities especially when coupled with automation. Biggest impact in the way of thinking and doing things was delivering the network and security services on the fly with the application, as part of the application blueprint through various IaaS orchestration platforms (ie Openstack, vRealize Automation etc.) which is the foundation of the concept that VMware came up in 2012 as “SDDC — Software Defined Datacenter”. Instantiating functions as completely distributed and portable networking functionality provided so much flexibility. In fact some had the view of “NSX” as yet another application in the infrastructure which is highly portable and helps with the workload and network/security automation as a whole. By that time, NSX truly had become that ubiquitous software layer which enabled agility in networking and security services for the virtual workloads. There had been a substantial change in delivering networking and security services for the applications, from months/weeks to days if not hours.

2. What changed from 2013 to now?

As of 2019, gone are the days of discussing whether if virtual switching layer can provide equivalent functionality of a physical hardware or whether if operationalising an overlay network is ready for prime time or micro-segmentation is a real thing or not. All these technological developments are almost a commodity and accepted as normal industry wide. However, looking back at the last 3–5 years, an infinite loop is forming. Let us explain; cloud enables mobile connectivity, mobile endpoints creating more data which makes artificial intelligence better, and artificial intelligence drives many more use cases at the edge which drives cloud adoption to store that data and do more computing on it.

The consumption based, easy to consume approach of public cloud has resonated with many organisations; making it a really attractive model for the businesses. They started to expand more applications & services to public cloud essentially which led the way to the common hybrid cloud/IT model of having services/applications in both on-prem and public cloud environments. Each and every business is now looking for different ways to enable IT elasticity and ultimately making IT so easy to consume. At the same time it is a big challenge cause the proliferation of both the number and type of endpoints in private and public cloud environments have been implausible. Which takes us to the other big change during course of last couple of years.

Nature’ s coincidence, about the same timeframes, containerisation took off in enterprise, led by Docker, as the next big thing in the application development world. A huge shift in developing/packaging applications, which plays extremely well with Devops methodology. Often called “operating system virtualisation” , along with Kubernetes as the defacto container orchestration layer, this whole movement enabled the developers to build truly portable and completely distributed application architectures (microservices) , completely decoupled from the underlying infrastructure.

Microservices is a software development approach in which the application is structured as a collection of loosely coupled services. The biggest impact of this to the infrastructure is, what used to be a monolithic app running on a single server, different components of that application sending socket calls to each other, now turned into hundreds of micro components sending http/https calls to each other through the network infrastructure, essentially which led to the famous term “application is the network”.

With all the shifts mentioned above, came the next challenge for VMware to extend the software based networking and security services layer to evolve from an on prem infrastructure specific, ESX hypervisor focused solution to a completely infrastructure, cloud and last but not least workload agnostic offering. Many organisations emphasise the need of having a consistent networking and policy structure across any cloud, any app or any infrastructure. Hence the focus for NSX platform is extending its unique capabilities from ESX hypervisor to also KVM, from on prem datacentres to all the way to any public cloud environment, from legacy application infrastructures to any cloud native implementation and to the branch/edge.

3. Challenges of a Network administrator in a Datacenter (Optimal Routing, Multi-Tenancy, Heterogeneity, Deliver network on-Demand)

Let’ s look back and remember the things that are worth to mention in the last two decades in datacenter networking.

With the advance in capabilities of newer switching/routing chipsets in the mid 2000s, conventional three tier (core-aggregation-access) datacenter network design evolved into much simpler silos based switching/routing architectures (leaf-spine) ; providing identical latency and bandwidth in terms of data plane functionality in a physical hardware for any pair of workloads in the DC.

In early 2010s, traditional network hardware vendors quickly realised that, with both compute virtualisation and Infrastructure as a Service (Iaas) approach gaining so much traction in the form of private and public cloud; it became inevitable to start developing fundamentals of network automation. Hence newer network hardware all had Open Application Programmable Interfaces (API) and Software Development Kits (SDK). Suddenly managing and maintaining the configuration state of a datacenter network became easier and simpler.

However main challenges for the network administrators stayed exactly the same. Let’ s talk about those, once again from a historical point of view.

First and foremost is delivering network on demand, in the same speed an application is put into service. This is almost an art depending on the networking solution chosen. Today composing an application in Kubernetes with the complete distributed switching, routing, load balancing, firewalling functions take no more than seconds. Although introduction of APIs, SDKs etc. in network hardware made it easier to provision the networking layer, considering the bigger picture of application and service delivery, constantly trying to integrate that hardware networking layer to the various solutions in the as-a-service orchestration layer could not deliver on its promise nor it could keep up with the highly dynamic changes in the compute virtualisation layer or containerisation layer for a variety of reasons. Let’ s highlight an obvious one. Today’ s reality is, in a typical datacenter, there are many different types of workloads; varying from bare metal servers to virtual machines to containers. Meaning that there is an absolute heterogeneity in place; which leads to the networking layer having to integrate with various orchestration solutions (i.e. Openstack, vRealize Automation, Openshift, Pivotal Container Service, in house platform using Ansible, Terraform, Kubernetes) all at the same time for the most efficient and feasible operational experience. Architecturally, specific to container orchestration (i.e. Kubernetes) , any kind of compute or network function had already been built into the software layer, making the network hardware itself irrelevant in provisioning workflow. Hence a network administrator needs to find a networking layer whose management plane can be integrated to as many as different orchestration solution out there but at the same time whose data plane can be extended to any infrastructure in order to address the any networking need of various workload types.

Then comes the typical isolation requirement, aka multi tenancy. There are many aspects in designing and architecting the network of a multi tenant environment. Such as choosing to implement various control plane technologies (Multi Protocol BGP, MPLS LDP etc.) in a datacenter plane and then having to carefully design the as is and to be scale of the data plane in the future (QinQ, MPLS etc.) . For instance an increase in the number of tenants all of which gradually but quickly becomes highly virtualised and/or containerised may cause a huge increase of entries in the data plane which may dictate a forklift upgrade/change of the networking hardware. In these types of environments the scale in the virtualised/containerised compute layer has a direct impact in the hardware networking layer. Considering service insertion in the data path, such as load balancing, next generation firewalling, which are mostly all provided as centralised functions attached to the spine switches, this will have even bigger impact in the design of the physical network topology and the specific network hardware. On the other hand, because of an unexpected capacity (TCAM) limitation in a leaf or spine hardware switch, the implementation model in compute layer may have to change (i.e. NATed container networks versus Routed container networks)

As a single monolithic application has been continuously being disaggregated into smaller pieces (first as virtual machines and now as containers) east west traffic in a datacenter is forecasted as more than %71. Conventional network design has always been as following; the network function, like routing or firewalling , is realised at a centralised component (i.e. as spine switches, physical firewall) will not only provide the most sub optimal path with added latency and wasted bandwith capacity but also it would make the troubleshooting process much more complex since the path that a packet takes could be really long. (i.e. workload to leaf switch to spine switch then all the way back) A software based networking solution on the other hand can provide fully distributed networking functions (switching, routing, firewalling) which enables fully distributed packet processing for any kind of network function. Routing or firewalling would occur right in the source workload hence saving processing cycles from the rest of the network devices. The other benefit is being able to scale out network and security functions along with the compute capacity. The last but not least and often disregarded value is decoupling/abstraction of the whole control+data plane of the network functions from the actual physical network topology; thus providing huge flexibility and segregation between different layers that make up the whole datacenter design.

4. Challenges of a Security administrator in a Datacenter (Perimeter Firewall is not enough, Firewall rule sprawl, Security around Application, Inconsistent Security policies across heterogeneous environments, Bare Metal Security, Zero Trust Model)

A bit of history again. In 1990s researchers came up with the idea of the perimeter firewall which was the obvious thing to do, providing legitimate access to internal users and drawing a clear boundary between corporate network and the internet. Back then organisations were controlling their own web, email and other services. Hence the typical attack that a hacker to use was to port scan and identify which well known common service ports, that are allowed by rules on the perimeter firewall, to be able to get in.

In early 2000s, increase in mobile workers, expanding on VPN connections to trusted partners sparked the irreversible transformation in corporate networks. The corporate applications and desktop operating systems in a datacenter began to interact much more with each other, however since the common services were behind the network perimeter, psychologically everyone felt comfortable enough with that level of control and surprisingly no one demonstrated any resistance against the chatty behavior of the interaction between various services in the datacenter. Ironically, using the existing perimeter firewall rules, it was possible to allow business partners and vendors more access to the invaluable internal network, essentially creating larger holes. The hacking methods also evolved with this; if the attacker could figure out who the business partner/vendor is, they can compromise them and then come in through the large hole created. It became pretty obvious that once the attacker steps foot into the internal network it was pretty easy to move around. Hence the perimeter firewall lost its value and to many attackers it became much less important. Most security practitioners were already finding themselves hardening each system individually which kind of was eliminating the concept of perimeter. Sure the perimeter firewall/security enforcement point was always going to be be there with generic policies.

Then in early 2010s came cloud and mobile connectivity, which only expanded the attack surface for the hackers, was the real death sentence of the perimeter. With cloud content delivery networks became so popular and web applications got so disaggregate. With mobile connectivity proliferation of mobile devices and types of access reached beyond anyone’ s imagination. All these revolutionary changes only made it to access the services in the on premise datacenter much easier. Making it easy for hackers to laterally move inside the corporate network once they are in. Lateral movement defines different techniques that attackers use to move through a network in search of valuable assets and data. Imagine a building having hundreds of main entrance doors each secured by guards but if you can manage to sneak in then you can freely move around inside the building without further authentication or authorization.

Forrester Research coined the term “Zero Trust” to the industry which basically has the principle of “never trust, always verify”. It is designed to address lateral threat movement within the network by leveraging micro-segmentation and granular perimeters enforcement. It is a phenomenal approach that would leap frog the whole security approach but the biggest challenge is not the implementation principle of itself but to get it implemented everywhere.

Cause it is so hard to realize zero trust approach in IT today. A business application is comprised of tens if not hundreds of various components i.e. a bunch of web servers (containers) which need to communicate to a middleware layer which is a bunch of virtual machines, which communicate to a physical database server. Two main issues here. One is enforcing the security controls (data plane) and the other is aligning the security policy to the application (orchestration) . Let us explain.

First of all, how will you implement security controls in the data plane ? Is it going to be a physical firewall being the default gateway of each and every network subnet and then having to configure IP address/subnet based firewall rules for each and every individual application ? If that is the case how will you manage to enforce security for communication that takes place between two workloads on the same subnet ? Will you implement a security agent on each and every type of workload ? If that is the case what is going to happen if someone tampers with that agent since it is sharing the same attack surface, which is the guest operating system ? What would happen if some workloads need to be moved to a different subnet ? Will you have to identify all the related firewall rules for that respective application and change the IP address/subnet definitions in those rules ? Can you imagine how hard is it to operationalise such a process/workflow?

Secondly, how will you make sure that security controls get updated when someone removes any components of that business application, or the whole application, or modifies a tier in that business application ? How will you adapt your infrastructure to this change ? Imagine if you have more than a few firewalls between the web tier and the database tier of that business application ? Overall how can you make sure that the security policy is always fully aligned to the application ?

It is kind of a chicken and egg situation cause actually the reason which created the two main issues mentioned above is the same thing which had to evolve and then disappear after two decades, and that is “Perimeter” . Let us explain.

Traditional and conventional security approach in IT has always been creating separate zones by placing perimeters firewalls in them and aligning the network layer to those zones and eventually those zones have become the source and destination in the security policies (aka firewall rules) . In today’ s IT, security policy is aligned solely to the infrastructure, not to the application. So when someone decommissions an application unfortunately the security policy stays there forever, which basically means that the security policy lifecycle and the application lifecyle is completely seperate from each other. This creates a huge firewall rule sprawl in perimeter firewalls.

At the same time we have to give credit to the security industry that whenever an application gets provisioned, security team has always been the last to know what that application is all about, what kind of access it needs etc. Ironically, it is the business application that has to be secured not the zone that the conventional IT security approach has created. Historically the holes in the perimeter firewalls have only been created to make the applications accessible from anywhere. However once the application is accessible, for zero trust to achieve you need to wrap the proper security policy around the components of that application so that if that application is compromised then the attacker should not laterally move around in the network. How can an individual first define and align the security policy to the business application ? Could application automation or infrastructure automation or overall orchestration layer help shape the next decade for security ? Or automation of security ?

Today’ s IT infrastructure is so heterogenous and diverse that there is on premise datacentre, public cloud, edge environments there are legacy physical servers, virtual machines, containers, public cloud instances. In such type of a landspace, the challenges to tackle remains the same. One is enforcement in the data plane and the other is orchestration and on going automation of the security policy .

Can a unified data plane be extended from VMs to containers to bare metal servers ? Is it wise to design the enforcement layer only in the form of agents ? Are there any other existing layers in the datacentre that can be leveraged ?

How can a single consistent security policy span across all these different type of environments and workload types ? What if you have multiple perimeter firewalls from different vendors along the path between the source and destination workloads of different types ? Hang on, isn’ t the most efficient way of implementing security to enforce security controls right at the source ? However again, if you have different firewall solutions in place does that mean having to run and maintain an unicorn like overarching orchestration layer to update all the policies in each firewall layer ?

Written by Dumlu Timuralp