Background

In this post I want to share with you the possibilities to remove problematic ESXi based Transport Nodes running NSX-T 2.4+ manually from NSX Manager. Especially including the nodes NSX configuration and accompanying VIBs. To be clear, this should be your last resort option if the regular procedures to remove ESXi based Transport Nodes from the NSX Manager do not work anymore. In production environments I would even suggest to perform these actions only if asked by VMware Tech Support.

I started this post because of issues I stumbled upon when upgrading NSX-T in my homelab. The base of this post are some articles written by fellow VMware community members that really helped troubleshooting the issues. So credits to the original authors: Wesley Geelhoed in his post about “Uninstall sequence NSX-T VIBs (2.2)” and a post by Manny Sidhu about “NSX-T Error – Failed to uninstall the software on host“, including the useful reply by Ruurd Bakker. Lastly the rather old NSX-T 2.2 documentation page about “Remove a host from NSX-T” provided some good background info.

The post by Wesley and the NSX-T documentation page are both based on the 2.2 version. Some of the VIB modules in the NSX-T 2.4 and 2.5 have been changed while others are added since version 2.2. So, a new post that puts it all together is needed from my perspective.

The issue

In my homelab I have a 4-node nested ESXi cluster, that ran NSX-T 2.4.2 on top of it, which was fine before the upgrade. After 2.5 was released a while ago, it was time to upgrade. After upgrading to 2.5 GA version, one host had issues. The culprit host did not join the Transport Zone anymore and also could not connect to the NSX Manager. Probably a result from me not checking the warnings in the Pre-Check stage during the Transport Node upgrade because it’s “just” a home lab :-).

In hindsight probably the root volume and /tmp mountpoint of the ESXi host did not have sufficient free space. During troubleshooting I did not get NSX to function properly anymore on that host. To reset the configuration, even the option “Remove NSX” in the NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu failed.

My last resort was to remove all the NSX related configuration and modules from the culprit host. During my search how to, I stumbled upon a couple of articles mentioned above. I will describe all options in this post. All the options only apply to ESXi based Transport Nodes running NSX-T 2.4+.

Solutions

To solve the issue, a couple of solutions are available. If possible put the node to be removed in “Maintenance Mode”, move the VM’s and reboot after performing one of the solutions.

The best order to manually remove Transport Nodes is shown below. Start with the simplest solution, down to the more complex and rigorous one using ESXi built-in tools.

  • Using NSX Manager
  • Using NSXCLI
  • Using ESXi native tools

Pay special attention in the case you have nodes with only 2 NIC’s. In this case you only have one N-VDS with all VMkernel ports connected to it. When you perform one of the solutions below on such nodes, all network connectivity will stop functioning. Connect to the console of the node first. In the nodes DCUI you can reset the network config to default and restore management connectivity to the node.

Using NSX Manager

If issues arise on ESXi based Transport Node general troubleshooting is the first step towards a solution. The NSX-T v2.3 Troubleshooting Guide can be helpful.

The regular solution to remove NSX from specific TN’s is to use the “Remove NSX” option in NSX Manager. The “Remove NSX” option can be found in NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu.

The “Remove NSX” option is only available for specific nodes if a “Transport Node Profile” is not attached to the cluster the node belongs to. More to that in the next section. When selecting the “Remove NSX” option, the “Delete Transport Node” screen appears and has 2 options.

  1. Uninstall NSX Components
  2. Force Delete

When nothing is checked, it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config, but will not de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS.

When selecting “Uninstall NSX Components” only, it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS.

When selecting “Force Delete” only, it will forcefully remove the TN configuration in NSX Manager. This option is very useful when the node cannot be recovered anymore and / or is not reachable over the management interface. If the node gets into the “Orphaned” state, the command needs to run twice before the TN config is fully cleaned up in NSX Manager.

When selecting both, it will forcefully remove the TN configuration in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs, even if VM’s are still attached to the N-VDS. In most cases you would not need both options to be selected before remove a faulty TN.

Transport Node Profiles

When a “Transport Node Profile” is attached to the cluster the faulty node resides in, the “Remove NSX” option is not available for that specific node, but only for the whole cluster. In this case detach the TN profile from the cluster in the “Actions” menu.

When detaching the “Transport Node Profile” from a cluster it should have no impact on data plane traffic within the cluster. Do not confuse the Detach option with the “Remove NSX” option!

Using NSXCLI

To remove a faulty TN, follow the general steps in the NSX-T 2.2 Install guide which mostly still applies to the 2.5 version. If all the NSX-T modules are still present on the node, this method should be your preferred one over using the ESXi native tools in the next section.

On the NSX Manager get the thumbprint.

manager> get certificate api thumbprint

Detach the Transport Node from NSX Manager if possible.

[root@esxi-01a:~] nsxcli
host> detach management-plane <MANAGER> username <ADMIN-USER> password <ADMIN-PASSWORD> thumbprint <MANAGER-THUMBPRINT>

Remove NSX-T filters. If the command is run without the -Override switch an error is displayed first.

[root@esxi-01a:~] vsipioctl clearallfilters
ERROR: Command clearallfilters is dangerous and can cause unintended consequence!
ERROR: Please supply first option -Override in order to override the safety guard and actually run the command.
[root@esxi-01a:~] vsipioctl clearallfilters -Override 
Removing all vmware-sfw filters...
Cleared dvfilter include table
Updated all VMs to remove filters.
Destroyed all filters (please ignore 'Function not implemented' error if there is).

Stop the netcpa service.

[root@esxi-01a:~] /etc/init.d/netcpad stop

Remove all NSX-T VIBs, relevant config and reboot the host afterwards. In NSX-T 2.4 the “del nsx” command is executed without warning, in contrary to 2.5 which displays the warning below.

[root@esxi-01a:~] nsxcli
esxi-01a> del nsx
 WARNING: Use this command as last resort ONLY when deleting through NSXT UI or API is not working!
 Please read documentation for 'Remove a Host from NSX-T Data Center or Uninstall NSX-T Data Center Completely' before executing this command.
 Are you sure you want to delete NSX environment on host? (yes/no) 

If the faulty node is still in a configured state in NSX Manager, perform the “Remove NSX” option (described in the previous section) with the “Force Delete” option selected. This will clean up its configuration in NSX Manager.

Using ESXCLI

When one of the above options is not possible due to whatever reason, a manual cleanup could be needed. A manual cleanup can be performed using standard ESXi based tools if the ESXi bootbank is still okay.

The config that is normally removed by NSX Manager or NSXCLI (besides the VIBs) is:

  • The NSX VMkernel ports (vxlan and hyperbus)
  • The NSX network IO filters
  • The NSX IP stacks (vxlan and hyperbus)

So, how to perform a manual cleanup. First try if the network IO filters can be removed. See the previous section how to perform this action.

Secondly delete the NSX VMkernel ports. Normally vmk10 is the vxlan (despite the name its actually the GENEVE overlay) kernel port and vmk50 is the hyperbus kernel port.

You might think, what is the hyperbus? The hyperbus is the component that performs the actual network auto-plumbing. In the image below this is the Intra T0/1 routing (#4 in the image) between the Distributed Router (DR) and Service Router (SR) components (169.254.0.0 subnet) and the Inter T0 /T1 routing (#3 in the image) between T1 SR and T0 DR (100.64.0.0 subnet).

NSX-T Hyperbus overview

List the kernel ports.

[root@esxi-01a:~] esxcli network ip interface list
 …
 vmk10
    Name: vmk10
    MAC Address: 00:50:56:ab:cd:ef
    Enabled: true
    Portset: DvsPortset-1
    Portgroup: N/A
    Netstack Instance: vxlan
    VDS Name: ndvSwitch0
    VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d
    VDS Port: 10
    VDS Connection: 1543783261
    Opaque Network ID: N/A
    Opaque Network Type: N/A
    External ID: N/A
    MTU: 1600
    TSO MSS: 65535
    RXDispQueue Size: 1
    Port ID: 67108870
 vmk50
    Name: vmk50
    MAC Address: 00:50:56:ab:cd:ef
    Enabled: true
    Portset: DvsPortset-1
    Portgroup: N/A
    Netstack Instance: hyperbus
    VDS Name: ndvSwitch0
    VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d
    VDS Port: 0d02d084-213c-491e-9bb5-1ab895e98b5d
    VDS Connection: 1543783261
    Opaque Network ID: N/A
    Opaque Network Type: N/A
    External ID: N/A
    MTU: 1500
    TSO MSS: 65535
    RXDispQueue Size: 1
    Port ID: 67108871

Remove the NSX VMkernel ports

[root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk10
[root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk50

Now list the netstacks

 [root@esxi-01a:~] esxcli network ip netstack list
 defaultTcpipStack
    Key: defaultTcpipStack
    Name: defaultTcpipStack
    State: 4660
 vxlan
    Key: vxlan
    Name: vxlan
    State: 4660
 hyperbus
    Key: hyperbus
    Name: hyperbus
    State: 4660

Remove the NSX related netstacks

[root@esxi-01a:~] esxcli network ip netstack remove --netstack=vxlan
[root@esxi-01a:~] esxcli network ip netstack remove --netstack=hyperbus

Now check if the N-VDS still exists. If so, the it has impact during the removal step of the actual VIBs later in this section.

[root@esxi-01a:~] esxcfg-vswitch -l

The last step is to remove all the NSX related VIBs that still exist on the node. You cannot remove the VIBs in any order since they have dependencies upon each other.

First list the VIBs to be removed from the node. They full list of VIBs normally on a ESXi based node are.

[root@host:~] esxcli software vib list | grep nsx
nsx-adf                        2.5.1.0.0-6.7.15314402                VMware  VMwareCertified   2020-01-03
 nsx-aggservice                 2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-cli-libs                   2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-common-libs                2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-context-mux                2.5.1.0.0esx67-15314456               VMware  VMwareCertified   2020-01-03
 nsx-esx-datapath               2.5.1.0.0-6.7.15314311                VMware  VMwareCertified   2020-01-03
 nsx-exporter                   2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-host                       2.5.1.0.0-6.7.15314289                VMware  VMwareCertified   2020-01-03
 nsx-metrics-libs               2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-mpa                        2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-nestdb-libs                2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-nestdb                     2.5.1.0.0-6.7.15314393                VMware  VMwareCertified   2020-01-03
 nsx-netcpa                     2.5.1.0.0-6.7.15314440                VMware  VMwareCertified   2020-01-03
 nsx-netopa                     2.5.1.0.0-6.7.15314363                VMware  VMwareCertified   2020-01-03
 nsx-opsagent                   2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-platform-client            2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-profiling-libs             2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-proxy                      2.5.1.0.0-6.7.15314435                VMware  VMwareCertified   2020-01-03
 nsx-python-gevent              1.1.0-9273114                         VMware  VMwareCertified   2018-12-02
 nsx-python-greenlet            0.4.9-12819723                        VMware  VMwareCertified   2019-09-20
 nsx-python-logging             2.5.1.0.0-6.7.15314402                VMware  VMwareCertified   2020-01-03
 nsx-python-protobuf            2.6.1-12818951                        VMware  VMwareCertified   2019-09-20
 nsx-rpc-libs                   2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-sfhc                       2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-shared-libs                2.5.1.0.0-6.7.15036308                VMware  VMwareCertified   2020-01-03
 nsx-upm-libs                   2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-vdpi                       2.5.1.0.0-6.7.15314422                VMware  VMwareCertified   2020-01-03
 nsxcli                         2.5.1.0.0-6.7.15314296                VMware  VMwareCertified   2020-01-03

Now remove the VIBs in the correct order from the node. The last VIB to be removed, “nsx-esx-datapath” cannot be removed if the N-VDS is still present on the node. Only for that VIB the “–no-live-install” switch must be added. Run the command for every VIB to be removed.

[root@esxi-01a:~] esxcli software vib remove -n <vib name below>
 nsx-host
 nsx-adf    
 nsx-exporter
 nsx-aggservice
 nsx-platform-client
 nsx-python-logging
 nsx-opsagent
 nsx-proxy
 nsx-nestdb
 nsx-sfhc
 nsx-context-mux
 nsx-python-protobuf
 nsx-python-greenlet
 nsx-python-gevent
 nsxcli
 nsx-netopa (2.5 only)
 nsx-netcpa
 nsx-profiling-libs
 nsx-mpa
 nsx-vdpi
 nsx-nestdb-libs
 nsx-rpc-libs
 nsx-metrics-libs
 nsx-upm-libs
 nsx-common-libs
 nsx-shared-libs
 nsx-cli-libs
 nsx-esx-datapath --no-live-install

Without the “–no-live-install” switch an error is thrown if a “opaque” portgroup connected to a N-VDS is present on the node.

NSX-T Remove datapath VIB

To conclude

Hopefully this post shows you which options are available to remediate ESXi based Transport Nodes back to a working state when things go bad. This post is not intended to troubleshoot or fix TN related issues, but more how to be able to clean it up and start with a fresh NSX-T config and VIBs without re-installing the node.

To wrap it all up, I would like to thank to Manny and Wesley, who are the writers of the articles I based mine upon. Lastly beware of the consequences performing one on the above options in a live environment. Contact VMware support if your are not sure what the impact may be.

Useful links

Uninstall sequence NSX-T VIBs (2.2)

NSX-T Error – Failed to uninstall the software on host. MPA not working. Host is disconnected (2.4)

NSX-T Data Center Troubleshooting Guide (2.3)

Remove a Host From NSX-T or Uninstall NSX-T Completely. Part of the NSX-T Install Guide (2.2)


2 Comments

Jurgen Mutzberg · October 16, 2020 at 10:55

Hi Daniel,
this is an EXCELLENT post!
It saved me hours to try to remove a faulty ESXi node from my NSX-T setup.
Really very well written post.
Thanks,
Jurgen

    Daniël Zuthof · October 22, 2020 at 22:24

    Hi Jurgen. It makes me happy to read my post is very useful to you. Thanks for your kind reaction.

Leave a Reply

Your email address will not be published. Required fields are marked *