VMware HCX with Out of Band Management Network

Dumlu Timuralp
5 min readOct 27, 2020

This disclaimer informs readers that the views, thoughts, and opinions expressed in this article belong solely to the author, and not necessarily to the author’s employer, organisation, committee or other group or individual.

VMware HCX is a platform used for workload migration across various clouds.

There are different migration types that can be used depending on the desired level of continuity and resiliency. VMware HCX ports and protocols outlines the specific ports that are used for these migrations.

In this article, an interesting behavior, that is observed in a recent proof of concept, will be explained.

Proof of Concept Topology

HCX IX (Interconnect) appliance VMs basically proxy all the migration traffic, which network engineering team benefits hugely since they no longer have to to route each and every subnet in between sites. Only VM Management subnets (red) and Uplink subnets (blue) are routed to each other between these sites and that is all.

HCX management and vSphere replication functions are collapsed onto a single subnet (green) and vMotion function uses its own dedicated subnet (turquoise). Within a site, VM Management subnet (red) is routed to/from HCX IX management/replication subnet (green) since that is a requirement for HCX Management plane.

Now, the topic of this article comes from the fact that for various reasons this customer is using an out of band management subnet for their ESX management vmkernel (vmk0) .

The issue manifested itself as bulk migrations working perfectly but all of the other migration types failing miserably. (i.e. Cold Migration, vMotion, RAV)

Some of the error messages seen on vCenter and HCX are shown below.

After spending time in identifying the behaviour, first clue came out from this KB article with regards to the vMotion and RAV migration failures observed. The KB article basically highlights two things :

  • In a vMotion event, vMotion protocol is used for copying hot data such as memory pages of the VM but also NFC protocol is used for copying the cold data (vmx logs, snapshots),
  • NFC protocol is sent on management vmkernel by default (usually vmk0) unless a dedicated vmkernel is configured with “Provisioning” service explicitly.

First bullet above makes it quite clear that even vMotion and RAV migration types use NFC protocol to copy certain files from source to destination.

Second bullet has a bigger impact on the given topology though; because ESX management subnet (yellow) is not routed to the HCX IX management/replication subnet (green) in the customer’s environment.

Workaround discussed with the customer is to put the NFC traffic onto the vmk2 interface of the ESX Hosts. That is done by configuring “Provisioning” service as an additional service on the vmk2 interface alongside “vSphere Replication” and “vSphere Replication NFC” services.

At this stage, since both HCX IX VM vNIC_0 and vmk2 of ESX Hosts are connected to the same subnet, the NFC traffic should work successfully between the HCX IX VM vNIC_0 and ESX Host vmk2. Even better, not only vMotion and RAV migration types but also Cold Migration should also work with this config since it also relies on NFC.

However the tests proved wrong. How come ?

Looking at the packet captures on the ESX Hosts (using pktcap-uw) it seems the HCX IX VM still sends the NFC traffic to the management vmkernel interface (vmk0) of the ESX Hosts; not to the vmk2 which is configured explicitly with “Provisioning” service that should own the NFC traffic.

Apparently the “Provisioning” network/vmkernel(vmk2) on an ESX Host would only be used if and only if the other side is also configured with it. In this case the “other side” is HCX IX appliance VM and it is not configured with “Provisioning” service since that is not exposed in HCX platform as an option. Hence the NFC traffic falls back to using management network interface on both HCX IX VM (vNIC_0) and ESX Host (vmk0)

Special thanks to @Nathan Prziborowski for demystifying the above point. This detail does not seem to be mentioned in public documentation. A documentation feedback is already sent through VMware HCX docs page.

Above finding made the routing between HCX IX management/replication subnet (green) and ESX Host management subnet (yellow) a must to configure and put in place.

Last but not least, “Provisioning” service owned NFC traffic and “vSphere Replication NFC” service owned NFC traffic are used in different ways. Couple of details are highlighted below.

  • “Provisioning” service owned NFC traffic is used by Cold Migration, vMotion and Replicated Assisted vMotion migration types. This NFC traffic is always sent on the management vNIC (vNIC_0) of the HCX IX VM and it is always sent to the management vmkernel interface (usually vmk0) of the ESX Hosts.
  • “vSphere Replication NFC” service owned NFC traffic is used by Bulk Migration and Replicated Assisted vMotion migration types. This traffic is also always sent on the management vNIC (vNIC_0) of the HCX IX VM. However it is capable of sending the traffic to the correct vmkernel on the ESX host which is configured with “vSphere Replication NFC” (which is vmk2 in this customer’ s environment)
  • In addition to the above bullet, even when the HCX IX VM is configured with an additional/dedicated vNIC for vSphere Replication (for instance vNIC_3, as an additional vNIC in the topology explained in this article), the “vSphere Replication NFC” owned NFC traffic is still sent on management vNIC (vNIC_0) of the HCX IX VM as mentioned in HCX User Guide here.

--

--

No responses yet