Migration from Collapsed to Dedicated NSX Edge Cluster

In a recent PSO VMware NSX engagement with one of our customers, and as their new hardware arrived, they decided to make a migration of ESG edges and DLR control VMs to a dedicated cluster. I want to share with you my experience trying to find the easiest and most straight-forward method to perform that migration seamlessly with the minimum introduced downtime.

Current Situation

Management cluster: vCenter, NSX manager, NSX Controllers, DLR Control VMs.
Compute/Edge Collapsed Cluster: Compute VMs, ESG edges.

ESG edges are not deployed in the management cluster since this cluster is not prepared for NSX.

New Situation

Management Cluster: vCenter, NSX manager, NSX Controllers.
Computer Cluster: Compute VMs.
Edge Cluster: ESG edges, DLR Control VMs.

So, what we are trying to achieve here is the below:

Dedicate compute resources to edge workloads as they are CPU-Centric so that they will not affect business applications running on the Compute cluster.
Provide a scalable edge architecture and hence more bandwidth for North-South traffic.

In this post, I will focus on the actual migration of the ESGs and DLR control VMs assuming that the new edge cluster is ready from the vSphere layer perspective.

As you can see we have two ESG appliances running in ECMP mode. Each ESG has two uplink interfaces connected to Two VLAN-backed portgroups. Thus, each ESG is peering with two neighbors for redundancy.

Here under, I will show the different approaches that can be used to accomplish this migration:

Approach1

As ESG edges are running in Active-Active mode, we can delete one ESG from the collapsed cluster and deploy and configure it on the new dedicated edge cluster while all traffic will still go through the remaining ESG. However, a customer’s requirement is that they don’t want to perform any configuration anymore and they want to migrate ESG workloads as they are.
You can’t just vmotion the VMs to the new cluster as the ESG edges and DLR control VMs will be redeployed in collapsed cluster (same current location) if you attempt to perform a “Redeploy” later as the below:

Approach2

Here we will take advantage of the fact that the Edges are effectively stateless VMs that are highly transportable and have their configuration stored centrally for easy redeployment and recovery. The below approach has been followed to migrate DLR control VM and ESG edges without recreating and re-configuring any VM:

Disconnect one ESG uplink interface and remove “EXTERNAL VLAN B” portgroup from the “connected to” field. This should be done on both ESG edges.
Disconnect ESG1 second uplink interface and remove “EXTERNAL VLAN A” portgroup from the “connected to” field. This should be done on first edge to be migrated only.
Remove “EXTERNAL VLAN B” portgroup from the collapsed cluster VDS switch. ESG2 will be left with only one BGP peer to the outside via “EXTERNAL VLAN A” portgroup.
Create “EXTERNAL VLAN B” portgroup on the new dedicated edge cluster VDS switch.
Migrate one ESG at a time. You can relocate ESG1 to the dedicated cluster by modifying the cluster and datastore in the NSX edge appliance configuration. This will trigger an edge redeployment to the new desired dedicated cluster. Most important here to note is that if you attempt to perform a “redeploy” operation in the future, the ESG will be redeployed again to the dedicated cluster. Connect ESG1 uplink interface to the “EXTERNAL VLAN B” portgroup.
Disconnect ESG2 second uplink interface and remove “EXTERNAL VLAN A” portgroup from the “connected to” field.
Remove “EXTERNAL VLAN A” portgroup from the collapsed cluster VDS switch and create it on the new dedicated edge cluster VDS switch.
Following the same concept, migrate ESG2 by modifying the cluster and datastore in the NSX edge appliance configuration.
Connect ESG2 uplink interfaces to both “EXTERNAL VLAN A” and “EXTERNAL VLAN B”portgroups. Connect ESG1 second uplink interface to “EXTERNAL VLAN A” portgroup.
Make sure each ESG has two BGP peerings established with the upstream routers.

For the DLR control VM, you can just follow the same approach and modify the cluster and datastore in its confguration to trigger a redeploy of the VM to the new dedicated edge cluster.

As a final note, it is highly recommended to create DRS Anti-affinity rules to make sure active ESG edges and active DLR control VM don’t run on the same host to minimize the failure in case one host in the edge cluster goes down.

Hope this post is informative,

Mohamad Alhussein