VCF / NSX-T Edge Node Resize
This blog post will cover the process to scale up your NSX-T edge nodes when used in a VCF environment.
You may find that your existing edge nodes are reaching their throughput maximums, or you may need to scale up to enable more CPU heavy features such as L7 Load Balancing, or Tanzu support.
Edge Sizing Guide
It’s not supported to simply add CPU/Memory, you must redeploy the edge nodes with the new size. Luckily there is a tool provided to achieve this.
Edge Node Resizing Tool
In this example, I accidentally deployed my edge nodes as “LARGE” size. You’ll probably want this in Production environments, but since we’re low on available resources I’m going to resize this to “SMALL”.
vcf@sddc-manager [ ~/resizer ]$ ./resize.sh --edge-cluster EC-01 --user administrator@vsphere.local --password VMware123! --form-factor SMALL VCF Edge node resizer tool, version 0.7 Logging to /home/vcf/resizer/resizer/edge_node_resizer_2023-06-30T01:15:43.log Resizing Edge nodes in Edge cluster EC-01 to form factor SMALL The Edge Node Resizer tool takes Edge nodes offline one at a time in order to recreate them with a specified form factor. This means that * Each Edge node's Tier-0 interfaces temporarily go offline during resizing * Tier-1 router services relocate from one Edge node to another This may lead to temporary network traffic interruptions during the resize. The full resize operation may take as long as it did to originally create and (if requested) expand the Edge cluster. Do you wish to proceed (y/n)? y Run confirmation accepted by user. Getting credentials from SDDC Manager.. Getting Bearer token count of WLDs supported by our NSX-T cluster: 1 workload_name = mgmt-domain Credential retrieval completed. Connection established to vCenter at 10.0.0.12 edge cluster EC-01: 5d169fc3-474e-42a6-b8e1-75fd265c73d3 Refreshing NSX view of Edge node id fa7f3afe-b3b9-4b8e-83f3-91ce36608322 Found vSphere rules for VM edge1-mgmt: VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e Edge node VM edge1-mgmt is in 0 VM groups for cluster mgmt-cluster-01 Refreshing NSX view of Edge node id 64749ef7-d464-48b6-9fdd-f3e8efffc9bf Found vSphere rules for VM edge2-mgmt: VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e Edge node VM edge2-mgmt is in 0 VM groups for cluster mgmt-cluster-01 For Edge cluster EC-01, Edge node edge1-mgmt (10.0.0.23) has form factor LARGE Edge node edge2-mgmt (10.0.0.24) has form factor LARGE Loading Edge cluster config info from /home/vcf/.vcf-edge-redeploy/EC-01.json Marking Edge cluster EC-01 cache with operation-in-progress = True Loading Edge cluster config info from /home/vcf/.vcf-edge-redeploy/EC-01.json Check that 2 x SMALL Edge node VMs fit in cluster mgmt-cluster-01's resource pool EC-01 Resource pool has 0 CPU and 0 RAM. After resize, pool's Edge nodes need 4000 CPU and 8192 RAM Resizing resource pool EC-01 posting to url: https://10.0.0.20/api/v1/transport-nodes/fa7f3afe-b3b9-4b8e-83f3-91ce36608322?action=redeploy resp.status_code = 200 EN VM moid: start=vm-37, cur=vm-37 tnState: ndsState=NODE_READY, outerState=in_progress EN VM moid: start=vm-37, cur=None tnState: ndsState=VM_DEPLOYMENT_RESTARTED, outerState=pending EN VM moid: start=vm-37, cur=None tnState: ndsState=REGISTRATION_PENDING, outerState=pending EN VM moid: start=vm-37, cur=None tnState: ndsState=NODE_NOT_READY, outerState=pending tnState: ndsState=NODE_READY, outerState=failed EN VM moid: start=vm-37, cur=vm-6081 tnState: ndsState=NODE_READY, outerState=failed EN VM moid: start=vm-37, cur=vm-6081 tnState: ndsState=NODE_READY, outerState=in_progress EN VM moid: start=vm-37, cur=vm-6081 ……… Redeployment successful for Edge node edge1-mgmt Waited 1254 seconds, or 21 minutes, for redeploy of Edge node edge1-mgmt AA rule VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e still exists, updating it.. Re-added edge1-mgmt to AA rule VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e Updating known_hosts entry for edge1-mgmt.vcf.sddc.lab (fa7f3afe-b3b9-4b8e-83f3-91ce36608322) Freshen VCF known_hosts key for edge1-mgmt.vcf.sddc.lab Temporarily enabling ssh to edge1-mgmt.vcf.sddc.lab * posting to url https://10.0.0.20/api/v1/transport-nodes/fa7f3afe-b3b9-4b8e-83f3-91ce36608322/node/services/ssh Get https://10.0.0.20/api/v1/transport-nodes/fa7f3afe-b3b9-4b8e-83f3-91ce36608322/node/services/ssh/status ssh runtime_state: running Freshen known_hosts key for edge1-mgmt.vcf.sddc.lab # edge1-mgmt.vcf.sddc.lab:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 Current ssh key obtained for edge1-mgmt.vcf.sddc.lab dropping old edge1-mgmt.vcf.sddc.lab, key type ssh-rsa Ran post, result = {} Re-disabling ssh to edge1-mgmt.vcf.sddc.lab Traceback (most recent call last): File "./resize.py", line 2233, in <module> redeployer.process() File "./resize.py", line 1955, in process self.resize_edge_nodes() File "./resize.py", line 533, in resize_edge_nodes self._do_requested_resize() File "./resize.py", line 558, in _do_requested_resize if not self._resize_edge_node(enInfo, doRollback=False): File "./resize.py", line 647, in _resize_edge_node self._update_edge_node_host_ssh_key(enInfo) File "./resize.py", line 831, in _update_edge_node_host_ssh_key dryrun=self.isDryRun()) File "/home/vcf/resizer/vcf_utils/edge_node_vcf_known_hosts_util.py", line 198, in freshenEdgeNodeInVcfKnownHosts absUrl = self._setTnSshState(edgeNodeNsxId, False) File "/home/vcf/resizer/vcf_utils/edge_node_vcf_known_hosts_util.py", line 108, in _setTnSshState state = self._getTnSshStatus(edgeNodeNsxId) File "/home/vcf/resizer/vcf_utils/edge_node_vcf_known_hosts_util.py", line 81, in _getTnSshStatus resp.raise_for_status() File "/usr/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: for url: https://10.0.0.20/api/v1/transport-nodes/fa7f3afe-b3b9-4b8e-83f3-91ce36608322/node/services/ssh/status vcf@sddc-manager [ ~/resizer ]$
Now this replaced the first node successfully, but then failed. It’s likely just a timeout because my lab is a little slow. Let’s try again.
vcf@sddc-manager [ ~/resizer ]$ ./resize.sh --edge-cluster EC-01 --user administrator@vsphere.local --password VMware123! --form-factor SMALL
VCF Edge node resizer tool, version 0.7
Logging to /home/vcf/resizer/resizer/edge_node_resizer_2023-06-30T01:40:41.log
Resizing Edge nodes in Edge cluster EC-01 to form factor SMALL
The Edge Node Resizer tool takes Edge nodes offline one at a time in order
to recreate them with a specified form factor. This means that
* Each Edge node's Tier-0 interfaces temporarily go offline during resizing
* Tier-1 router services relocate from one Edge node to another
This may lead to temporary network traffic interruptions during the resize.
The full resize operation may take as long as it did to originally create
and (if requested) expand the Edge cluster.
Do you wish to proceed (y/n)? y
Run confirmation accepted by user.
Getting credentials from SDDC Manager..
Getting Bearer token
count of WLDs supported by our NSX-T cluster: 1
workload_name = mgmt-domain
Credential retrieval completed.
Connection established to vCenter at 10.0.0.12
edge cluster EC-01: 5d169fc3-474e-42a6-b8e1-75fd265c73d3
Refreshing NSX view of Edge node id fa7f3afe-b3b9-4b8e-83f3-91ce36608322
Found vSphere rules for VM edge1-mgmt:
VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e
Edge node VM edge1-mgmt is in 0 VM groups for cluster mgmt-cluster-01
Refreshing NSX view of Edge node id 64749ef7-d464-48b6-9fdd-f3e8efffc9bf
Found vSphere rules for VM edge2-mgmt:
VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e
Edge node VM edge2-mgmt is in 0 VM groups for cluster mgmt-cluster-01
For Edge cluster EC-01,
Edge node edge1-mgmt (10.0.0.23) has form factor SMALL
Edge node edge2-mgmt (10.0.0.24) has form factor LARGE
Loading Edge cluster config info from /home/vcf/.vcf-edge-redeploy/EC-01.json
Existing cache for EC-01 shows a redeploy operation still in progress, so not refreshing cache from live configuration now.
Marking Edge cluster EC-01 cache with operation-in-progress = True
Loading Edge cluster config info from /home/vcf/.vcf-edge-redeploy/EC-01.json
Check that 2 x SMALL Edge node VMs fit in cluster mgmt-cluster-01's resource pool EC-01
Resource pool has 4000 CPU and 8192 RAM.
After resize, pool's Edge nodes need 4000 CPU and 8192 RAM
Resource pool is large enough: no resize needed for EC-01
Edge node edge1-mgmt already has desired form-factor of small, not resizing it.
Updating known_hosts entry for edge1-mgmt.vcf.sddc.lab (fa7f3afe-b3b9-4b8e-83f3-91ce36608322)
Freshen VCF known_hosts key for edge1-mgmt.vcf.sddc.lab
Freshen known_hosts key for edge1-mgmt.vcf.sddc.lab
# edge1-mgmt.vcf.sddc.lab:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.5
Current ssh key obtained for edge1-mgmt.vcf.sddc.lab
dropping old edge1-mgmt.vcf.sddc.lab, key type ssh-rsa
Ran post, result = {}
AA rule VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e still exists, updating it..
Re-added edge1-mgmt to AA rule VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e
Updating known_hosts entry for edge1-mgmt.vcf.sddc.lab (fa7f3afe-b3b9-4b8e-83f3-91ce36608322)
Freshen VCF known_hosts key for edge1-mgmt.vcf.sddc.lab
Freshen known_hosts key for edge1-mgmt.vcf.sddc.lab
# edge1-mgmt.vcf.sddc.lab:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.5
Current ssh key obtained for edge1-mgmt.vcf.sddc.lab
dropping old edge1-mgmt.vcf.sddc.lab, key type ssh-rsa
Ran post, result = {}
posting to url: https://10.0.0.20/api/v1/transport-nodes/64749ef7-d464-48b6-9fdd-f3e8efffc9bf?action=redeploy
resp.status_code = 200
EN VM moid: start=vm-39, cur=vm-39
tnState: ndsState=NODE_READY, outerState=in_progress
EN VM moid: start=vm-39, cur=None
tnState: ndsState=VM_DEPLOYMENT_RESTARTED, outerState=pending
EN VM moid: start=vm-39, cur=None
tnState: ndsState=VM_DEPLOYMENT_IN_PROGRESS, outerState=pending
tnState: ndsState=REGISTRATION_PENDING, outerState=pending
EN VM moid: start=vm-39, cur=None
tnState: ndsState=NODE_NOT_READY, outerState=pending
EN VM moid: start=vm-39, cur=vm-6082
tnState: ndsState=NODE_READY, outerState=in_progress
EN VM moid: start=vm-39, cur=vm-6082
………………
Redeployment successful for Edge node edge2-mgmt
Waited 1317 seconds, or 22 minutes, for redeploy of Edge node edge2-mgmt
AA rule VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e still exists, updating it..
Re-added edge2-mgmt to AA rule VCF-edge_EC-01_antiAffinity_b38f42b9beb851202facc2bcc7cd6d7e
Updating known_hosts entry for edge2-mgmt.vcf.sddc.lab (64749ef7-d464-48b6-9fdd-f3e8efffc9bf)
Freshen VCF known_hosts key for edge2-mgmt.vcf.sddc.lab
Temporarily enabling ssh to edge2-mgmt.vcf.sddc.lab
* posting to url https://10.0.0.20/api/v1/transport-nodes/64749ef7-d464-48b6-9fdd-f3e8efffc9bf/node/services/ssh
Get https://10.0.0.20/api/v1/transport-nodes/64749ef7-d464-48b6-9fdd-f3e8efffc9bf/node/services/ssh/status
ssh runtime_state: running
Freshen known_hosts key for edge2-mgmt.vcf.sddc.lab
# edge2-mgmt.vcf.sddc.lab:22 SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.5
Current ssh key obtained for edge2-mgmt.vcf.sddc.lab
dropping old edge2-mgmt.vcf.sddc.lab, key type ssh-rsa
Ran post, result = {}
Re-disabling ssh to edge2-mgmt.vcf.sddc.lab
* posting to url https://10.0.0.20/api/v1/transport-nodes/64749ef7-d464-48b6-9fdd-f3e8efffc9bf/node/services/ssh
Get https://10.0.0.20/api/v1/transport-nodes/64749ef7-d464-48b6-9fdd-f3e8efffc9bf/node/services/ssh/status
ssh runtime_state: stopped
Resize of Edge cluster EC-01 nodes completed.
Marking Edge cluster EC-01 cache with operation-in-progress = False
Loading Edge cluster config info from /home/vcf/.vcf-edge-redeploy/EC-01.json
Total run time: 0:25:03, or 1503 seconds
Log written to /home/vcf/resizer/resizer/edge_node_resizer_2023-06-30T01:40:41.log
vcf@sddc-manager [ ~/resizer ]$
When we ran the script again, it detected that a resize operation was already in progress and picked up where it left off. This time it completed successfully and we now have a fully resized edge cluster!
Note: it’s recommended to perform this task during a maintenance window as there will be momentary traffic interruption.