In this post I want to share with you the possibilities to remove problematic ESXi based Transport Nodes running NSX-T 2.4+ manually from NSX Manager. Especially including the nodes NSX configuration and accompanying VIBs. To be clear, this should be your last resort option if the regular procedures to remove ESXi based Transport Nodes from the NSX Manager do not work anymore. In production environments I would even suggest to perform these actions only if asked by VMware Tech Support.
I started this post because of issues I stumbled upon when upgrading NSX-T in my homelab. The base of this post are some articles written by fellow VMware community members that really helped troubleshooting the issues. So credits to the original authors: Wesley Geelhoed in his post about “Uninstall sequence NSX-T VIBs (2.2)” and a post by Manny Sidhu about “NSX-T Error – Failed to uninstall the software on host“, including the useful reply by Ruurd Bakker. Lastly the rather old NSX-T 2.2 documentation page about “Remove a host from NSX-T” provided some good background info.
The post by Wesley and the NSX-T documentation page are both based on the 2.2 version. Some of the VIB modules in the NSX-T 2.4 and 2.5 have been changed while others are added since version 2.2. So, a new post that puts it all together is needed from my perspective.
- Post updated for NSX-T 3.x which has other UI options, CLI output and VIB list / de-install order
In my homelab I have a 4-node nested ESXi cluster, that ran NSX-T 2.4.2 on top of it, which was fine before the upgrade. After 2.5 was released a while ago, it was time to upgrade. After upgrading to 2.5 GA version, one host had issues. The culprit host did not join the Transport Zone anymore and also could not connect to the NSX Manager. Probably a result from me not checking the warnings in the Pre-Check stage during the Transport Node upgrade because it’s “just” a home lab :-).
In hindsight probably the root volume and /tmp mountpoint of the ESXi host did not have sufficient free space. During troubleshooting I did not get NSX to function properly anymore on that host. To reset the configuration, even the option “Remove NSX” in the NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu failed.
My last resort was to remove all the NSX related configuration and modules from the culprit host. During my search how to, I stumbled upon a couple of articles mentioned above. I will describe all options in this post. All the options only apply to ESXi based Transport Nodes running NSX-T 2.4/5 and 3.x.
To solve the issue, a couple of solutions are available. If possible put the node to be removed in “Maintenance Mode”, move the VM’s and reboot after performing one of the solutions.
The best order to manually remove Transport Nodes is shown below. Start with the simplest solution, down to the more complex and rigorous one using ESXi built-in tools.
- Using NSX Manager
- Using NSXCLI
- Using ESXi native tools
Pay special attention in the case you have nodes with only 2 NIC’s. In this case you only have one N-VDS with all VMkernel ports connected to it. When you perform one of the solutions below on such nodes, all network connectivity will stop functioning. Connect to the console of the node first. In the nodes DCUI you can reset the network config to default and restore management connectivity to the node.
Using NSX Manager
The install guide also mentions:
If NSX Intelligence is also deployed on the host, uninstallation of NSX-T Data Center will fail because all transport nodes become part of a default network security group. To successfully uninstall NSX-T Data Center, you also need to select the Force Delete option before proceeding with uninstallation.
The regular solution to remove NSX from a specific TN is to use the “Remove NSX” option in NSX Manager. The “Remove NSX” option can be found in NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu.
The “Remove NSX” option is only available if a “Transport Node Profile” is not attached to the cluster the node belongs to. More to that in the next section “Transport Node Profiles”. When selecting the “Remove NSX” option, the “Delete Transport Node” screen appears and has 2 options.
- Uninstall NSX Components (NSX-T 2.x only)
- Force Delete
When nothing is checked, it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config, but will not de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS or VDS.
When selecting “Uninstall NSX Components” only (NSX-T 2.x), it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS.
When selecting “Force Delete” only, it will forcefully remove the TN configuration in NSX Manager. This option is very useful when the node cannot be recovered anymore and / or is not reachable over the management interface. If the node gets into the “Orphaned” state, the command needs to run twice before the TN config is fully cleaned up in NSX Manager.
When selecting both (NSX-T 2.x), it will forcefully remove the TN configuration in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs, even if VM’s are still attached to the N-VDS. In most cases you would not need both options to be selected before remove a faulty TN.
Transport Node Profiles
When a “Transport Node Profile” is attached to the cluster the faulty node resides in, the “Remove NSX” option is not available for that specific node, but only for the whole cluster. In this case use the “Detach Transport Node Profile” option in the “Actions” menu.
When detaching the “Transport Node Profile” from a cluster it has no impact on data plane traffic within the cluster. Do not confuse the Detach option with the “Remove NSX” option, which will remove the configuration and VIBs from all the nodes in the cluster!
To remove a faulty TN, follow the general steps in the NSX-T 2.2 Install guide or NSX-T 3.x Installation Guide which mostly still applies to the 2.5+ versions. If all the NSX-T modules are still present on the node, this method should be your preferred one over using the ESXi native tools in the next section.
On the NSX Manager get the thumbprint.
manager> get certificate api thumbprint
Detach the Transport Node from NSX Manager if possible.
[root@esxi-01a:~] nsxcli host> detach management-plane <MANAGER> username <ADMIN-USER> password <ADMIN-PASSWORD> thumbprint <MANAGER-THUMBPRINT>
Remove NSX-T filters. If the command is run without the -Override switch an error is displayed first.
[root@esxi-01a:~] vsipioctl clearallfilters ERROR: Command clearallfilters is dangerous and can cause unintended consequence! ERROR: Please supply first option -Override in order to override the safety guard and actually run the command.
[root@esxi-01a:~] vsipioctl clearallfilters -Override Removing all vmware-sfw filters... Cleared dvfilter include table. Updated all VMs to remove filters. Destroyed all disconnected filters (please ignore 'Function not implemented' error if there is).
Stop the netcpa service (NSX-T 2.x).
[root@esxi-01a:~] /etc/init.d/netcpad stop
Remove all NSX-T VIBs, relevant config and reboot the host afterwards. In NSX-T 2.4 the “del nsx” command is executed without warning, in contrary to 2.5 and higher which displays a warning like the one below.
[root@esxi-01a:~] nsxcli esxi-01a> del nsx **** STOP STOP STOP STOP STOP **** Carefully read the requirements and limitations of this command: 1. Read NSX-T documentation for 'Remove a Host from NSX-T Data Center or Uninstall NSX-T Data Center Completely'. 2. Deletion of this Transport Node from the NSX-T UI or API failed, and this is the last resort. 3. If this is an ESXi host: a. The host must be in maintenance mode. b. All resources attached to NSXPGs must be moved out. If the above conditions for ESXi hosts are not met, the command WILL fail. 4. If this is a Linux host: a. If KVM is managing VM tenants then shut them down before running this command. b. This command should be run from the host console and may fail if run from an SSH client or any other network based shell client. c. The 'nsxcli -c del nsx' form of this command is not supported 5. For command progress check /scratch/log/nsxcli.log on ESXi host or /var/log/nsxcli.log on non-ESXi host. Are you sure you want to remove NSX-T on this host? (yes/no)
If the faulty node is still in a configured state in NSX Manager, perform the “Remove NSX” option (described in the previous section) with the “Force Delete” option selected. This will clean up its configuration in NSX Manager.
When one of the above options is not possible due to whatever reason, a manual cleanup could be needed. A manual cleanup can be performed using standard ESXi based tools if the ESXi bootbank is still okay.
The config that is normally removed by NSX Manager or NSXCLI (besides the VIBs) is:
- The NSX VMkernel ports (vxlan and hyperbus)
- The NSX network IO filters
- The NSX IP stacks (vxlan and hyperbus)
So, how to perform a manual cleanup. First try if the network IO filters can be removed. See the previous section how to perform this action.
Secondly delete the NSX VMkernel ports. Normally vmk10 is the vxlan (despite the name its actually the GENEVE overlay) kernel port and vmk50 is the hyperbus kernel port.
You might think, what is the hyperbus? The hyperbus is the component that performs the actual network auto-plumbing. In the image below this is the Intra T0/1 routing (#4 in the image) between the Distributed Router (DR) and Service Router (SR) components (169.254.0.0 subnet) and the Inter T0 /T1 routing (#3 in the image) between T1 SR and T0 DR (100.64.0.0 subnet).
List the kernel ports.
[root@esxi-01a:~] esxcli network ip interface list … vmk10 Name: vmk10 MAC Address: 00:50:56:ab:cd:ef Enabled: true Portset: DvsPortset-1 Portgroup: N/A Netstack Instance: vxlan VDS Name: ndvSwitch0 VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d VDS Port: 10 VDS Connection: 1543783261 Opaque Network ID: N/A Opaque Network Type: N/A External ID: N/A MTU: 1600 TSO MSS: 65535 RXDispQueue Size: 1 Port ID: 67108870 vmk50 Name: vmk50 MAC Address: 00:50:56:ab:cd:ef Enabled: true Portset: DvsPortset-1 Portgroup: N/A Netstack Instance: hyperbus VDS Name: ndvSwitch0 VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d VDS Port: 0d02d084-213c-491e-9bb5-1ab895e98b5d VDS Connection: 1543783261 Opaque Network ID: N/A Opaque Network Type: N/A External ID: N/A MTU: 1500 TSO MSS: 65535 RXDispQueue Size: 1 Port ID: 67108871
Remove the NSX VMkernel ports (could include vmk11)
[root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk10 [root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk50
Now list the netstacks
[root@esxi-01a:~] esxcli network ip netstack list defaultTcpipStack Key: defaultTcpipStack Name: defaultTcpipStack State: 4660 vxlan Key: vxlan Name: vxlan State: 4660 hyperbus Key: hyperbus Name: hyperbus State: 4660
Remove the NSX related netstacks
[root@esxi-01a:~] esxcli network ip netstack remove --netstack=vxlan [root@esxi-01a:~] esxcli network ip netstack remove --netstack=hyperbus
Now check if the N-VDS still exists. If so, the it has impact during the removal step of the actual VIBs later in this section.
[root@esxi-01a:~] esxcfg-vswitch -l
The last step is to remove all the NSX related VIBs that still exist on the node. You cannot remove the VIBs in any order since they have dependencies upon each other. First list the VIBs to be removed from the node.
The full list of NSX-T 2.x VIBs normally present on a ESXi based node are.
[root@host:~] esxcli software vib list | grep -E 'nsx|vsipfwlib' nsx-adf 184.108.40.206.0-6.7.15314402 VMware VMwareCertified 2020-01-03 nsx-aggservice 220.127.116.11.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-cli-libs 18.104.22.168.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-common-libs 22.214.171.124.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-context-mux 126.96.36.199.0esx67-15314456 VMware VMwareCertified 2020-01-03 nsx-esx-datapath 188.8.131.52.0-6.7.15314311 VMware VMwareCertified 2020-01-03 nsx-exporter 184.108.40.206.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-host 220.127.116.11.0-6.7.15314289 VMware VMwareCertified 2020-01-03 nsx-metrics-libs 18.104.22.168.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-mpa 22.214.171.124.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-nestdb-libs 126.96.36.199.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-nestdb 188.8.131.52.0-6.7.15314393 VMware VMwareCertified 2020-01-03 nsx-netcpa 184.108.40.206.0-6.7.15314440 VMware VMwareCertified 2020-01-03 nsx-netopa 220.127.116.11.0-6.7.15314363 VMware VMwareCertified 2020-01-03 nsx-opsagent 18.104.22.168.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-platform-client 22.214.171.124.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-profiling-libs 126.96.36.199.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-proxy 188.8.131.52.0-6.7.15314435 VMware VMwareCertified 2020-01-03 nsx-python-gevent 1.1.0-9273114 VMware VMwareCertified 2018-12-02 nsx-python-greenlet 0.4.9-12819723 VMware VMwareCertified 2019-09-20 nsx-python-logging 184.108.40.206.0-6.7.15314402 VMware VMwareCertified 2020-01-03 nsx-python-protobuf 2.6.1-12818951 VMware VMwareCertified 2019-09-20 nsx-rpc-libs 220.127.116.11.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-sfhc 18.104.22.168.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-shared-libs 22.214.171.124.0-6.7.15036308 VMware VMwareCertified 2020-01-03 nsx-upm-libs 126.96.36.199.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-vdpi 188.8.131.52.0-6.7.15314422 VMware VMwareCertified 2020-01-03 nsxcli 184.108.40.206.0-6.7.15314296 VMware VMwareCertified 2020-01-03
NSX-T 3.x has a different set of VIBs. On a ESXi host, they are:
[root@host:~] esxcli software vib list | grep -E 'nsx|vsipfwlib' nsx-adf 220.127.116.11.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-cfgagent 18.104.22.168.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-context-mux 22.214.171.124.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-cpp-libs 126.96.36.199.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-esx-datapath 188.8.131.52.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-exporter 184.108.40.206.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-host 220.127.116.11.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-ids 18.104.22.168.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-monitoring 22.214.171.124.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-mpa 126.96.36.199.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-nestdb 188.8.131.52.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-netopa 184.108.40.206.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-opsagent 220.127.116.11.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-platform-client 18.104.22.168.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-proto2-libs 22.214.171.124.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-proxy 126.96.36.199.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-python-gevent 1.1.0-15366959 VMware VMwareCertified 2021-02-16 nsx-python-greenlet 0.4.14-16723199 VMware VMwareCertified 2021-02-16 nsx-python-logging 188.8.131.52.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-python-protobuf 2.6.1-16723197 VMware VMwareCertified 2021-02-16 nsx-python-utils 184.108.40.206.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-sfhc 220.127.116.11.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-shared-libs 18.104.22.168.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsx-vdpi 22.214.171.124.0-7.0.17883598 VMware VMwareCertified 2021-04-19 nsxcli 126.96.36.199.0-7.0.17883598 VMware VMwareCertified 2021-04-19 vsipfwlib 188.8.131.52.0-7.0.17883598 VMware VMwareCertified 2021-04-19
Now remove the VIBs in the correct order from the node. The last VIB to be removed, “nsx-esx-datapath” cannot be removed if the N-VDS is still present on the node. Only for that VIB the “--no-live-install” switch must be added. Run the command for every VIB to be removed.
For NSX-T 2.4 and 2.5 the de-install list and order is:
[root@esxi-01a:~] esxcli software vib remove -n <vib name below> nsx-host nsx-adf nsx-exporter nsx-aggservice nsx-platform-client nsx-python-logging nsx-opsagent nsx-proxy nsx-nestdb nsx-sfhc nsx-context-mux nsx-python-protobuf nsx-python-greenlet nsx-python-gevent nsxcli nsx-netopa (2.5 only) nsx-netcpa nsx-profiling-libs nsx-mpa nsx-vdpi nsx-nestdb-libs nsx-rpc-libs nsx-metrics-libs nsx-upm-libs nsx-common-libs nsx-shared-libs nsx-cli-libs nsx-esx-datapath --no-live-install
For NSX-T 3.x the de-install list and order is different:
root@esxi-01a:~] esxcli software vib remove -n <vib name below> nsx-host nsx-adf nsx-exporter nsx-context-mux nsx-platform-client nsx-opsagent nsx-proxy nsx-sfhc nsx-netopa nsxcli nsx-nestdb nsx-cfgagent nsx-mpa nsx-vdpi nsx-ids (nsx-idps in 3.0) nsx-monitoring nsx-python-logging nsx-python-protobuf nsx-python-greenlet nsx-python-gevent nsx-python-utils nsx-cpp-libs nsx-esx-datapath --no-live-install vsipfwlib nsx-proto2-libs nsx-shared-libs
Without the “–no-live-install” switch an error is thrown if a “opaque” portgroup connected to a N-VDS is present on the node.
Hopefully this post shows you which options are available to remediate ESXi based Transport Nodes back to a working state when things go bad. This post is not intended to troubleshoot or fix TN related issues, but more how to be able to clean it up and start with a fresh NSX-T config and VIBs without re-installing the node.
To wrap it all up, I would like to thank to Manny and Wesley, who are the writers of the articles I based mine upon. Lastly beware of the consequences performing one on the above options in a live environment. Contact VMware support if your are not sure what the impact may be.