In this post I want to share with you the possibilities to remove problematic ESXi based Transport Nodes running NSX-T 2.4+ manually from NSX Manager. Especially including the nodes NSX configuration and accompanying VIBs. To be clear, this should be your last resort option if the regular procedures to remove ESXi based Transport Nodes from the NSX Manager do not work anymore. In production environments I would even suggest to perform these actions only if asked by VMware Tech Support.
I started this post because of issues I stumbled upon when upgrading NSX-T in my homelab. The base of this post are some articles written by fellow VMware community members that really helped troubleshooting the issues. So credits to the original authors: Wesley Geelhoed in his post about “Uninstall sequence NSX-T VIBs (2.2)” and a post by Manny Sidhu about “NSX-T Error – Failed to uninstall the software on host“, including the useful reply by Ruurd Bakker. Lastly the rather old NSX-T 2.2 documentation page about “Remove a host from NSX-T” provided some good background info.
The post by Wesley and the NSX-T documentation page are both based on the 2.2 version. Some of the VIB modules in the NSX-T 2.4 and 2.5 have been changed while others are added since version 2.2. So, a new post that puts it all together is needed from my perspective.
In my homelab I have a 4-node nested ESXi cluster, that ran NSX-T 2.4.2 on top of it, which was fine before the upgrade. After 2.5 was released a while ago, it was time to upgrade. After upgrading to 2.5 GA version, one host had issues. The culprit host did not join the Transport Zone anymore and also could not connect to the NSX Manager. Probably a result from me not checking the warnings in the Pre-Check stage during the Transport Node upgrade because it’s “just” a home lab :-).
In hindsight probably the root volume and /tmp mountpoint of the ESXi host did not have sufficient free space. During troubleshooting I did not get NSX to function properly anymore on that host. To reset the configuration, even the option “Remove NSX” in the NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu failed.
My last resort was to remove all the NSX related configuration and modules from the culprit host. During my search how to, I stumbled upon a couple of articles mentioned above. I will describe all options in this post. All the options only apply to ESXi based Transport Nodes running NSX-T 2.4+.
To solve the issue, a couple of solutions are available. If possible put the node to be removed in “Maintenance Mode”, move the VM’s and reboot after performing one of the solutions.
The best order to manually remove Transport Nodes is shown below. Start with the simplest solution, down to the more complex and rigorous one using ESXi built-in tools.
- Using NSX Manager
- Using NSXCLI
- Using ESXi native tools
Pay special attention in the case you have nodes with only 2 NIC’s. In this case you only have one N-VDS with all VMkernel ports connected to it. When you perform one of the solutions below on such nodes, all network connectivity will stop functioning. Connect to the console of the node first. In the nodes DCUI you can reset the network config to default and restore management connectivity to the node.
Using NSX Manager
If issues arise on ESXi based Transport Node general troubleshooting is the first step towards a solution. The NSX-T v2.3 Troubleshooting Guide can be helpful.
The regular solution to remove NSX from specific TN’s is to use the “Remove NSX” option in NSX Manager. The “Remove NSX” option can be found in NSX Manager > System > Fabric > Nodes > “Host Transport Node” menu.
The “Remove NSX” option is only available for specific nodes if a “Transport Node Profile” is not attached to the cluster the node belongs to. More to that in the next section. When selecting the “Remove NSX” option, the “Delete Transport Node” screen appears and has 2 options.
- Uninstall NSX Components
- Force Delete
When nothing is checked, it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config, but will not de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS.
When selecting “Uninstall NSX Components” only, it will try to remove the TN configuration in a safe way in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs. An error is thrown if VM’s are still attached to the N-VDS.
When selecting “Force Delete” only, it will forcefully remove the TN configuration in NSX Manager. This option is very useful when the node cannot be recovered anymore and / or is not reachable over the management interface. If the node gets into the “Orphaned” state, the command needs to run twice before the TN config is fully cleaned up in NSX Manager.
When selecting both, it will forcefully remove the TN configuration in NSX Manager. If the node is reachable over the management interface it also clears the node local config and will de-install the VIBs, even if VM’s are still attached to the N-VDS. In most cases you would not need both options to be selected before remove a faulty TN.
Transport Node Profiles
When a “Transport Node Profile” is attached to the cluster the faulty node resides in, the “Remove NSX” option is not available for that specific node, but only for the whole cluster. In this case detach the TN profile from the cluster in the “Actions” menu.
When detaching the “Transport Node Profile” from a cluster it should have no impact on data plane traffic within the cluster. Do not confuse the Detach option with the “Remove NSX” option!
To remove a faulty TN, follow the general steps in the NSX-T 2.2 Install guide which mostly still applies to the 2.5 version. If all the NSX-T modules are still present on the node, this method should be your preferred one over using the ESXi native tools in the next section.
On the NSX Manager get the thumbprint.
manager> get certificate api thumbprint
Detach the Transport Node from NSX Manager if possible.
[root@esxi-01a:~] nsxcli host> detach management-plane <MANAGER> username <ADMIN-USER> password <ADMIN-PASSWORD> thumbprint <MANAGER-THUMBPRINT>
Remove NSX-T filters. If the command is run without the -Override switch an error is displayed first.
[root@esxi-01a:~] vsipioctl clearallfilters ERROR: Command clearallfilters is dangerous and can cause unintended consequence! ERROR: Please supply first option -Override in order to override the safety guard and actually run the command.
[root@esxi-01a:~] vsipioctl clearallfilters -Override Removing all vmware-sfw filters... Cleared dvfilter include table Updated all VMs to remove filters. Destroyed all filters (please ignore 'Function not implemented' error if there is).
Stop the netcpa service.
[root@esxi-01a:~] /etc/init.d/netcpad stop
Remove all NSX-T VIBs, relevant config and reboot the host afterwards. In NSX-T 2.4 the “del nsx” command is executed without warning, in contrary to 2.5 which displays the warning below.
[root@esxi-01a:~] nsxcli esxi-01a> del nsx WARNING: Use this command as last resort ONLY when deleting through NSXT UI or API is not working! Please read documentation for 'Remove a Host from NSX-T Data Center or Uninstall NSX-T Data Center Completely' before executing this command. Are you sure you want to delete NSX environment on host? (yes/no)
If the faulty node is still in a configured state in NSX Manager, perform the “Remove NSX” option (described in the previous section) with the “Force Delete” option selected. This will clean up its configuration in NSX Manager.
When one of the above options is not possible due to whatever reason, a manual cleanup could be needed. A manual cleanup can be performed using standard ESXi based tools if the ESXi bootbank is still okay.
The config that is normally removed by NSX Manager or NSXCLI (besides the VIBs) is:
- The NSX VMkernel ports (vxlan and hyperbus)
- The NSX network IO filters
- The NSX IP stacks (vxlan and hyperbus)
So, how to perform a manual cleanup. First try if the network IO filters can be removed. See the previous section how to perform this action.
Secondly delete the NSX VMkernel ports. Normally vmk10 is the vxlan (despite the name its actually the GENEVE overlay) kernel port and vmk50 is the hyperbus kernel port.
You might think, what is the hyperbus? The hyperbus is the component that performs the actual network auto-plumbing. In the image below this is the Intra T0/1 routing (#4 in the image) between the Distributed Router (DR) and Service Router (SR) components (169.254.0.0 subnet) and the Inter T0 /T1 routing (#3 in the image) between T1 SR and T0 DR (100.64.0.0 subnet).
List the kernel ports.
[root@esxi-01a:~] esxcli network ip interface list … vmk10 Name: vmk10 MAC Address: 00:50:56:ab:cd:ef Enabled: true Portset: DvsPortset-1 Portgroup: N/A Netstack Instance: vxlan VDS Name: ndvSwitch0 VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d VDS Port: 10 VDS Connection: 1543783261 Opaque Network ID: N/A Opaque Network Type: N/A External ID: N/A MTU: 1600 TSO MSS: 65535 RXDispQueue Size: 1 Port ID: 67108870 vmk50 Name: vmk50 MAC Address: 00:50:56:ab:cd:ef Enabled: true Portset: DvsPortset-1 Portgroup: N/A Netstack Instance: hyperbus VDS Name: ndvSwitch0 VDS UUID: 19 f4 3b 76 1a 2b 4f 88-b2 bd 67 b3 86 11 49 2d VDS Port: 0d02d084-213c-491e-9bb5-1ab895e98b5d VDS Connection: 1543783261 Opaque Network ID: N/A Opaque Network Type: N/A External ID: N/A MTU: 1500 TSO MSS: 65535 RXDispQueue Size: 1 Port ID: 67108871
Remove the NSX VMkernel ports
[root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk10 [root@esxi-01a:~] esxcli network ip interface remove --interface-name=vmk50
Now list the netstacks
[root@esxi-01a:~] esxcli network ip netstack list defaultTcpipStack Key: defaultTcpipStack Name: defaultTcpipStack State: 4660 vxlan Key: vxlan Name: vxlan State: 4660 hyperbus Key: hyperbus Name: hyperbus State: 4660
Remove the NSX related netstacks
[root@esxi-01a:~] esxcli network ip netstack remove --netstack=vxlan [root@esxi-01a:~] esxcli network ip netstack remove --netstack=hyperbus
Now check if the N-VDS still exists. If so, the it has impact during the removal step of the actual VIBs later in this section.
[root@esxi-01a:~] esxcfg-vswitch -l
The last step is to remove all the NSX related VIBs that still exist on the node. You cannot remove the VIBs in any order since they have dependencies upon each other.
First list the VIBs to be removed from the node. They full list of VIBs normally on a ESXi based node are.
[root@host:~] esxcli software vib list | grep nsx nsx-adf 18.104.22.168.0-6.7.15314402 VMware VMwareCertified 2020-01-03 nsx-aggservice 22.214.171.124.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-cli-libs 126.96.36.199.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-common-libs 188.8.131.52.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-context-mux 184.108.40.206.0esx67-15314456 VMware VMwareCertified 2020-01-03 nsx-esx-datapath 220.127.116.11.0-6.7.15314311 VMware VMwareCertified 2020-01-03 nsx-exporter 18.104.22.168.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-host 22.214.171.124.0-6.7.15314289 VMware VMwareCertified 2020-01-03 nsx-metrics-libs 126.96.36.199.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-mpa 188.8.131.52.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-nestdb-libs 184.108.40.206.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-nestdb 220.127.116.11.0-6.7.15314393 VMware VMwareCertified 2020-01-03 nsx-netcpa 18.104.22.168.0-6.7.15314440 VMware VMwareCertified 2020-01-03 nsx-netopa 22.214.171.124.0-6.7.15314363 VMware VMwareCertified 2020-01-03 nsx-opsagent 126.96.36.199.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-platform-client 188.8.131.52.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-profiling-libs 184.108.40.206.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-proxy 220.127.116.11.0-6.7.15314435 VMware VMwareCertified 2020-01-03 nsx-python-gevent 1.1.0-9273114 VMware VMwareCertified 2018-12-02 nsx-python-greenlet 0.4.9-12819723 VMware VMwareCertified 2019-09-20 nsx-python-logging 18.104.22.168.0-6.7.15314402 VMware VMwareCertified 2020-01-03 nsx-python-protobuf 2.6.1-12818951 VMware VMwareCertified 2019-09-20 nsx-rpc-libs 22.214.171.124.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-sfhc 126.96.36.199.0-6.7.15314423 VMware VMwareCertified 2020-01-03 nsx-shared-libs 188.8.131.52.0-6.7.15036308 VMware VMwareCertified 2020-01-03 nsx-upm-libs 184.108.40.206.0-6.7.15314375 VMware VMwareCertified 2020-01-03 nsx-vdpi 220.127.116.11.0-6.7.15314422 VMware VMwareCertified 2020-01-03 nsxcli 18.104.22.168.0-6.7.15314296 VMware VMwareCertified 2020-01-03
Now remove the VIBs in the correct order from the node. The last VIB to be removed, “nsx-esx-datapath” cannot be removed if the N-VDS is still present on the node. Only for that VIB the “–no-live-install” switch must be added. Run the command for every VIB to be removed.
[root@esxi-01a:~] esxcli software vib remove -n <vib name below> nsx-host nsx-adf nsx-exporter nsx-aggservice nsx-platform-client nsx-python-logging nsx-opsagent nsx-proxy nsx-nestdb nsx-sfhc nsx-context-mux nsx-python-protobuf nsx-python-greenlet nsx-python-gevent nsxcli nsx-netopa (2.5 only) nsx-netcpa nsx-profiling-libs nsx-mpa nsx-vdpi nsx-nestdb-libs nsx-rpc-libs nsx-metrics-libs nsx-upm-libs nsx-common-libs nsx-shared-libs nsx-cli-libs nsx-esx-datapath --no-live-install
Without the “–no-live-install” switch an error is thrown if a “opaque” portgroup connected to a N-VDS is present on the node.
Hopefully this post shows you which options are available to remediate ESXi based Transport Nodes back to a working state when things go bad. This post is not intended to troubleshoot or fix TN related issues, but more how to be able to clean it up and start with a fresh NSX-T config and VIBs without re-installing the node.
To wrap it all up, I would like to thank to Manny and Wesley, who are the writers of the articles I based mine upon. Lastly beware of the consequences performing one on the above options in a live environment. Contact VMware support if your are not sure what the impact may be.