Background

In this post I want to share with you the possibilities to remove problematic ESXi based Transport Nodes running NSX-T 2.4+ manually from NSX Manager. Especially including the nodes NSX configuration and accompanying VIBs. To be clear, this should be your last resort option if the regular procedures to remove ESXi based Transport Nodes from the NSX Manager do not work anymore. In production environments I would even suggest to perform these actions only if asked by VMware Tech Support.

I started this post because of issues I stumbled upon when upgrading NSX-T in my homelab. The base of this post are some articles written by fellow VMware community members that really helped troubleshooting the issues. So credits to the original authors: Wesley Geelhoed in his post about “Uninstall sequence NSX-T VIBs (2.2)” and a post by Manny Sidhu about “NSX-T Error – Failed to uninstall the software on host“, including the useful reply by Ruurd Bakker. Lastly the rather old NSX-T 2.2 documentation page about “Remove a host from NSX-T” provided some good background info.

The post by Wesley and the NSX-T documentation page are both based on the 2.2 version. Some of the VIB modules in the NSX-T 2.4 and 2.5 have been changed while others are added since version 2.2. So, a new post that puts it all together is needed from my perspective.

Update 19-4-2021

  • Post updated for NSX-T 3.x which has other UI options, CLI output and VIB list / de-install order

The issue

In my homelab I have a 4-node nested ESXi cluster, that ran NSX-T 2.4.2 on top of it, which was fine before the upgrade. After 2.5 was released a while ago, it was time to upgrade. After upgrading to 2.5 GA version, one host had issues. The culprit host did not join the Transport Zone anymore and also could not connect to the NSX Manager. Probably a result from me not checking the warnings in the Pre-Check stage during the Transport Node upgrade because it’s “just” a home lab :-).

In hindsight probably the root volume and /tmp mountpoint of the ESXi host did not have sufficient free space. During troubleshooting I did not get NSX to function properly anymore on that host. To reset the configuration, even the option “Remove NSX” in the NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu failed.

My last resort was to remove all the NSX related configuration and modules from the culprit host. During my search how to, I stumbled upon a couple of articles mentioned above. I will describe all options in this post. All the options only apply to ESXi based Transport Nodes running NSX-T 2.4/5 and 3.x.

Solutions

To solve the issue, a couple of solutions are available. If possible put the node to be removed in “Maintenance Mode”, move the VM’s and reboot after performing one of the solutions.

The best order to manually remove Transport Nodes is shown below. Start with the simplest solution, down to the more complex and rigorous one using ESXi built-in tools.

  • Using NSX Manager
  • Using NSXCLI
  • Using ESXi native tools

Pay special attention in the case you have nodes with only 2 NIC’s. In this case you only have one N-VDS with all VMkernel ports connected to it. When you perform one of the solutions below on such nodes, all network connectivity will stop functioning. Connect to the console of the node first. In the nodes DCUI you can reset the network config to default and restore management connectivity to the node.

Using NSX Manager

If issues arise on ESXi based Transport Node (TN) general troubleshooting is the first step towards a solution. The NSX-T v2.3 Troubleshooting Guide or NSX-T 3.x Installation Guide can be helpful.

The install guide also mentions:

If NSX Intelligence is also deployed on the host, uninstallation of NSX-T Data Center will fail because all transport nodes become part of a default network security group. To successfully uninstall NSX-T Data Center, you also need to select the Force Delete option before proceeding with uninstallation.

The regular solution to remove NSX from a specific TN is to use the “Remove NSX” option in NSX Manager. The “Remove NSX” option can be found in NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu.

The “Remove NSX” option is only available if a “Transport Node Profile” is not attached to the cluster the node belongs to. More to that in the next section “Transport Node Profiles”. When selecting the “Remove NSX” option, the “Delete Transport Node” screen appears and has 2 options.

  1. Uninstall NSX Components (NSX-T 2.x only)
  2. Force Delete
NSX-T 2.x – Delete Transport Node
NSX-T 3.x – Delete Transport Node

When nothing is checked, it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config, but will not de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS or VDS.

Validation Error is shown when VM’s still attached to VDS

When selecting “Uninstall NSX Components” only (NSX-T 2.x), it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS.

When selecting “Force Delete” only, it will forcefully remove the TN configuration in NSX Manager. This option is very useful when the node cannot be recovered anymore and / or is not reachable over the management interface. If the node gets into the “Orphaned” state, the command needs to run twice before the TN config is fully cleaned up in NSX Manager.

When selecting both (NSX-T 2.x), it will forcefully remove the TN configuration in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs, even if VM’s are still attached to the N-VDS. In most cases you would not need both options to be selected before remove a faulty TN.

Transport Node Profiles

When a “Transport Node Profile” is attached to the cluster the faulty node resides in, the “Remove NSX” option is not available for that specific node, but only for the whole cluster. In this case use the “Detach Transport Node Profile” option in the “Actions” menu.

When detaching the “Transport Node Profile” from a cluster it has no impact on data plane traffic within the cluster. Do not confuse the Detach option with the “Remove NSX” option, which will remove the configuration and VIBs from all the nodes in the cluster!

Using NSXCLI

To remove a faulty TN, follow the general steps in the NSX-T 2.2 Install guide or NSX-T 3.x Installation Guide which mostly still applies to the 2.5+ versions. If all the NSX-T modules are still present on the node, this method should be your preferred one over using the ESXi native tools in the next section.

On the NSX Manager get the thumbprint.

manager> get certificate api thumbprint

Detach the Transport Node from NSX Manager if possible.

[root@esxi-01a:~] nsxcli
host> detach management-plane <MANAGER> username <ADMIN-USER> password <ADMIN-PASSWORD> thumbprint <MANAGER-THUMBPRINT>

Remove NSX-T filters. If the command is run without the -Override switch an error is displayed first.

[root@esxi-01a:~] vsipioctl clearallfilters
ERROR: Command clearallfilters is dangerous and can cause unintended consequence!
ERROR: Please supply first option -Override in order to override the safety guard and actually run the command.
[root@esxi-01a:~] vsipioctl clearallfilters -Override 
Removing all vmware-sfw filters...
Cleared dvfilter include table.
Updated all VMs to remove filters.
Destroyed all disconnected filters (please ignore 'Function not implemented' error if there is).

Stop the netcpa service (NSX-T 2.x).

[root@esxi-01a:~] /etc/init.d/netcpad stop

Remove all NSX-T VIBs, relevant config and reboot the host afterwards. In NSX-T 2.4 the “del nsx” command is executed without warning, in contrary to 2.5 and higher which displays a warning like the one below.

[root@esxi-01a:~] nsxcli
esxi-01a> del nsx
**** STOP STOP STOP STOP STOP ****

Carefully read the requirements and limitations of this command:

1. Read NSX-T documentation for 'Remove a Host from NSX-T Data Center or Uninstall NSX-T Data Center Completely'.

2. Deletion of this Transport Node from the NSX-T UI or API failed, and this is the last resort.

3. If this is an ESXi host:
   a. The host must be in maintenance mode.
   b. All resources attached to NSXPGs must be moved out. 

If the above conditions for ESXi hosts are not met, the command WILL fail.

4. If this is a Linux host:
   a. If KVM is managing VM tenants then shut them down before running this command.
   b. This command should be run from the host console and may fail if run from an SSH client
    or any other network based shell client.
   c. The 'nsxcli -c del nsx' form of this command is not supported
5. For command progress check /scratch/log/nsxcli.log on ESXi host or /var/log/nsxcli.log on non-ESXi host. 

Are you sure you want to remove NSX-T on this host? (yes/no) 

If the faulty node is still in a configured state in NSX Manager, perform the “Remove NSX” option (described in the previous section) with the “Force Delete” option selected. This will clean up its configuration in NSX Manager.

Using ESXCLI

When one of the above options is not possible due to whatever reason, a manual cleanup could be needed. A manual cleanup can be performed using standard ESXi based tools if the ESXi bootbank is still okay.

The config that is normally removed by NSX Manager or NSXCLI (besides the VIBs) is:

  • The NSX VMkernel ports (vxlan and hyperbus)
  • The NSX network IO filters
  • The NSX IP stacks (vxlan and hyperbus)

So, how to perform a manual cleanup. First try if the network IO filters can be removed. See the previous section how to perform this action.

Secondly delete the NSX VMkernel ports. Normally vmk10 is the vxlan (despite the name its actually the GENEVE overlay) kernel port and vmk50 is the hyperbus kernel port.

You might think, what is the hyperbus? The hyperbus is the component that performs the actual network auto-plumbing. In the image below this is the Intra T0/1 routing (#4 in the image) between the Distributed Router (DR) and Service Router (SR) components (169.254.0.0 subnet) and the Inter T0 /T1 routing (#3 in the image) between T1 SR and T0 DR (100.64.0.0 subnet).

NSX-T Hyperbus overview

List the kernel ports.

[root@esxi-01a:~] esxcli network ip interface list
 …
 vmk10 
    Name: vmk10
    MAC Address: 00:50:56:ab:cd:ef
    Enabled: true
    Portset: DvsPortset-1
    Portgroup: N/A
    Netstack Instance: vxlan
    VDS Name: ndvSwitch0
    VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d
    VDS Port: 10
    VDS Connection: 1543783261
    Opaque Network ID: N/A
    Opaque Network Type: N/A
    External ID: N/A
    MTU: 1600
    TSO MSS: 65535
    RXDispQueue Size: 1
    Port ID: 67108870
 vmk50
    Name: vmk50
    MAC Address: 00:50:56:ab:cd:ef
    Enabled: true
    Portset: DvsPortset-1
    Portgroup: N/A
    Netstack Instance: hyperbus
    VDS Name: ndvSwitch0
    VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d
    VDS Port: 0d02d084-213c-491e-9bb5-1ab895e98b5d
    VDS Connection: 1543783261
    Opaque Network ID: N/A
    Opaque Network Type: N/A
    External ID: N/A
    MTU: 1500
    TSO MSS: 65535
    RXDispQueue Size: 1
    Port ID: 67108871

Remove the NSX VMkernel ports (could include vmk11)

[root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk10
[root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk50

Now list the netstacks

 [root@esxi-01a:~] esxcli network ip netstack list
 defaultTcpipStack
    Key: defaultTcpipStack
    Name: defaultTcpipStack
    State: 4660
 vxlan
    Key: vxlan
    Name: vxlan
    State: 4660
 hyperbus
    Key: hyperbus
    Name: hyperbus
    State: 4660

Remove the NSX related netstacks

[root@esxi-01a:~] esxcli network ip netstack remove --netstack=vxlan
[root@esxi-01a:~] esxcli network ip netstack remove --netstack=hyperbus

Now check if the N-VDS still exists. If so, the it has impact during the removal step of the actual VIBs later in this section.

[root@esxi-01a:~] esxcfg-vswitch -l

The last step is to remove all the NSX related VIBs that still exist on the node. You cannot remove the VIBs in any order since they have dependencies upon each other. First list the VIBs to be removed from the node.

The full list of NSX-T 2.x VIBs normally present on a ESXi based node are.

[root@host:~] esxcli software vib list | grep -E 'nsx|vsipfwlib'
nsx-adf                        2.5.1.0.0-6.7.15314402                VMware  VMwareCertified   2020-01-03
 nsx-aggservice                 2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-cli-libs                   2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-common-libs                2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-context-mux                2.5.1.0.0esx67-15314456               VMware  VMwareCertified   2020-01-03
 nsx-esx-datapath               2.5.1.0.0-6.7.15314311                VMware  VMwareCertified   2020-01-03
 nsx-exporter                   2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-host                       2.5.1.0.0-6.7.15314289                VMware  VMwareCertified   2020-01-03
 nsx-metrics-libs               2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-mpa                        2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-nestdb-libs                2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-nestdb                     2.5.1.0.0-6.7.15314393                VMware  VMwareCertified   2020-01-03
 nsx-netcpa                     2.5.1.0.0-6.7.15314440                VMware  VMwareCertified   2020-01-03
 nsx-netopa                     2.5.1.0.0-6.7.15314363                VMware  VMwareCertified   2020-01-03
 nsx-opsagent                   2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-platform-client            2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-profiling-libs             2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-proxy                      2.5.1.0.0-6.7.15314435                VMware  VMwareCertified   2020-01-03
 nsx-python-gevent              1.1.0-9273114                         VMware  VMwareCertified   2018-12-02
 nsx-python-greenlet            0.4.9-12819723                        VMware  VMwareCertified   2019-09-20
 nsx-python-logging             2.5.1.0.0-6.7.15314402                VMware  VMwareCertified   2020-01-03
 nsx-python-protobuf            2.6.1-12818951                        VMware  VMwareCertified   2019-09-20
 nsx-rpc-libs                   2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-sfhc                       2.5.1.0.0-6.7.15314423                VMware  VMwareCertified   2020-01-03
 nsx-shared-libs                2.5.1.0.0-6.7.15036308                VMware  VMwareCertified   2020-01-03
 nsx-upm-libs                   2.5.1.0.0-6.7.15314375                VMware  VMwareCertified   2020-01-03
 nsx-vdpi                       2.5.1.0.0-6.7.15314422                VMware  VMwareCertified   2020-01-03
 nsxcli                         2.5.1.0.0-6.7.15314296                VMware  VMwareCertified   2020-01-03

NSX-T 3.x has a different set of VIBs. On a ESXi host, they are:

[root@host:~] esxcli software vib list | grep -E 'nsx|vsipfwlib'
nsx-adf                        3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-cfgagent                   3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-context-mux                3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-cpp-libs                   3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-esx-datapath               3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-exporter                   3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-host                       3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-ids                        3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-monitoring                 3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-mpa                        3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-nestdb                     3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-netopa                     3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-opsagent                   3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-platform-client            3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-proto2-libs                3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-proxy                      3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-python-gevent              1.1.0-15366959                       VMware  VMwareCertified   2021-02-16
 nsx-python-greenlet            0.4.14-16723199                      VMware  VMwareCertified   2021-02-16
 nsx-python-logging             3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-python-protobuf            2.6.1-16723197                       VMware  VMwareCertified   2021-02-16
 nsx-python-utils               3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-sfhc                       3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-shared-libs                3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsx-vdpi                       3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 nsxcli                         3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19
 vsipfwlib                      3.1.2.0.0-7.0.17883598               VMware  VMwareCertified   2021-04-19  

Now remove the VIBs in the correct order from the node. The last VIB to be removed, “nsx-esx-datapath” cannot be removed if the N-VDS is still present on the node. Only for that VIB the “--no-live-install” switch must be added. Run the command for every VIB to be removed.

For NSX-T 2.4 and 2.5 the de-install list and order is:

[root@esxi-01a:~] esxcli software vib remove -n <vib name below>
 nsx-host
 nsx-adf    
 nsx-exporter
 nsx-aggservice
 nsx-platform-client
 nsx-python-logging
 nsx-opsagent
 nsx-proxy
 nsx-nestdb
 nsx-sfhc
 nsx-context-mux
 nsx-python-protobuf
 nsx-python-greenlet
 nsx-python-gevent
 nsxcli
 nsx-netopa (2.5 only)
 nsx-netcpa
 nsx-profiling-libs
 nsx-mpa
 nsx-vdpi
 nsx-nestdb-libs
 nsx-rpc-libs
 nsx-metrics-libs
 nsx-upm-libs
 nsx-common-libs
 nsx-shared-libs
 nsx-cli-libs
 nsx-esx-datapath --no-live-install

For NSX-T 3.x the de-install list and order is different:

root@esxi-01a:~] esxcli software vib remove -n <vib name below>
nsx-host
nsx-adf
nsx-exporter
nsx-context-mux
nsx-platform-client
nsx-opsagent
nsx-proxy
nsx-sfhc
nsx-netopa
nsxcli
nsx-nestdb
nsx-cfgagent
nsx-mpa
nsx-vdpi
nsx-ids (nsx-idps in 3.0)
nsx-monitoring
nsx-python-logging
nsx-python-protobuf
nsx-python-greenlet
nsx-python-gevent
nsx-python-utils
nsx-cpp-libs
nsx-esx-datapath --no-live-install
vsipfwlib
nsx-proto2-libs
nsx-shared-libs

Without the “–no-live-install” switch an error is thrown if a “opaque” portgroup connected to a N-VDS is present on the node.

NSX-T Remove datapath VIB

To conclude

Hopefully this post shows you which options are available to remediate ESXi based Transport Nodes back to a working state when things go bad. This post is not intended to troubleshoot or fix TN related issues, but more how to be able to clean it up and start with a fresh NSX-T config and VIBs without re-installing the node.

To wrap it all up, I would like to thank to Manny and Wesley, who are the writers of the articles I based mine upon. Lastly beware of the consequences performing one on the above options in a live environment. Contact VMware support if your are not sure what the impact may be.

Useful links

Uninstall sequence NSX-T VIBs (2.2)

NSX-T Error – Failed to uninstall the software on host. MPA not working. Host is disconnected (2.4)

NSX-T Data Center Troubleshooting Guide (2.3)

NSX-T 3.x Installation Guide

Remove a Host From NSX-T or Uninstall NSX-T Completely. Part of the NSX-T Install Guide (2.2)


4 Comments

Jurgen Mutzberg · October 16, 2020 at 10:55

Hi Daniel,
this is an EXCELLENT post!
It saved me hours to try to remove a faulty ESXi node from my NSX-T setup.
Really very well written post.
Thanks,
Jurgen

    Daniël Zuthof · October 22, 2020 at 22:24

    Hi Jurgen. It makes me happy to read my post is very useful to you. Thanks for your kind reaction.

Calvin Wong · July 20, 2021 at 05:30

Good one !! the buggy NSX-T needs more nice doc like what you’ve written!

😀

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *