vCSA Automated Backup Failure



Recently we have gone through the process of upgrading our Windows 6.0 vCenter Server with external SQL to vCSA 6.5. I must say now how good the entire process was from start to finish, VMware have really done themselves proud on that tool. Our environment isn’t huge but it is big enough that we thought we might see problems – but no!

Part of the migration work was to get backups up and runnign as they were with our Windows vCenter (if not slightly different/better). My understanding is that the supported method for backup is to use the VAMI interface and run a full “file dump” backup of the vCSA with which you can restore into any blank deployed vCSA and you are back in the game. We have a Rubrik for snapshotting but using the VMware method is of course supported and preferred.

The Issue

Upon using the VMware provided Bash Script we encountered the following error in the backup.log file that is produced:

“{“type”:”com.vmware.vapi.std.errors.unauthenticated”,”value”:{“messages”:[{“args”:[],”default_message”:”Unable to authenticate user”,”id”:”vapi.security.authentication.invalid”}]}}”


Further investigation showed further errors in the VAPI endpoint log

We could run a manual backup from the VAMI interface as the root user but just not using the bash script which is essentially using the VAMI API to curl a request to run a backup. The error above seems related to “authentication_sso.py” and being unable to validate the signing chain signature. Without further help there was no way I was going in to modify or look at that script on my own on a now Production vCSA.

I also created a seperate master user in the @vsphere.local domain to test running the backups but still had no luck.

I ran the script manually and the problem occured at the start of the POST to the appliances rest API.

The Fix

After speaking with several smart people in the vExpert slack channel, I raised a case with VMware support. I eventually received a response which told me to edit the following file:

There is a value that needed changing from:

To the following:

Be careful with the amendment, there is space indentation on the code and there must be exactly 8 spaces in from the new line

Then a simple stop and start of the applmgmt service to apply the fix:

Now the script runs perfectly daily to our backup respository. I believe this might become defunct in vSphere 6.7 as I think there is now a GUI way of scheduling backups!

vCSA 6.5 High Availability Configuration Error

Recently I have been experimenting with configuring the built-in vCSA 6.5 HA functionality. Upon reading the documentation found here. I set about the task of configuring a basic HA deployment.

The error I saw upon completing the wizard was:

“A general system error occured: Failed to run pre-setup”.


Unfortunately, there wasn’t much to go on in the vCenter logs via the web GUI so it was time to SSH into the vCSA and go digging around for some logs with a little more information. After a brief meander, I found the following log

The interesting contents of the log were spat out as follows:

Looking at the log, it seemed that insufficient privileges were given to the user trying to create vcha user (root!). I then remembered the recent issues that VMware have had with Photon and root passwords expiring after 365 days. I logged into the VAMI for the vCSA and tried to reset the password but I was given an error.

The fix, in this case, was to simply reset the root password of the user via the bash shell.

At this point I was able to login with the new password and then login to the VAMI and set the root password to never expire. You can also do it via the command line using the “chage” command on the root user.

After restarting the deployment the pre-checks ran successfully and the configuration continued!

Hopefully this might help someone who is trying to do something similar!

Migrating ESXi Management VMkernel

I have been doing a fair amount of work with NSX recently. In order to start this work we have had some environment changes to go through before achieving this. One of the changes we had to make was to the network that contains the VMkernel for host management traffic. The overall aim was to migrate the interfaces to a new management VLAN (new subnet, gateway, etc).

Here is how I managed to do it without disruption to any existing management or services running.

1) The first step was to create a portgroup on my vDS for the new Management VLAN that had been trunked to the hosts.




I would advise to configure the port group further for your environment based on VMware Network Best Practices for things like Traffic Shaping, Teaming/Failover, etc.

2) Now the port group exists, add in a new VMkernel for all of your hosts for management traffic. For me, I ended up with 3 vmks: old management (vmk0), vMotion (vmk1) and new management (vmk2).


3) From here, I put hosts into maintenance mode that I was going to reconfigure, just to be on the safe side.

4) At this point, it isn’t possible to remove the existing vmk0 because it is in use. The reason for this, is the hosts TCP/IP stack configuration has the old VMkernel gateway configured. This should be changed to the new management network gateway address on each host:


5) From here, I disconnected the hosts from vCenter.


6) I then changed the host records of my ESXi servers to the new management IP address. Allowed some propagation (in fact I checked from the vCSA appliance that it had picked up the newest record from my DNS servers).

7) Reconnect the host(s) back into vCenter.


8) It is now possible to remove the old management VMkernel adapter (vmk0 in my case).


9) I did follow through the process of rebooting my hosts before exiting maintenance mode, but I do not actually think it matters too much.

There we go! A fairly straight forward process and one that I can’t imagine many people doing. I did have a look to see if anyone else had performed a similar process but they hadn’t moved subnet and gateway. Hopefully this might help someone out there who wants to do this!

Rubrik – PowerShell/API SLA Backup & Restores

Having been lucky enough to procure a Rubrik Cloud Data Management appliance at my work recently; we have had the pleasure of experiencing a fantastic technical solution which has assisted us in improving our backup/recover and business continuity planning. The solution, for us, is still in its infancy but we hope to scale and grow as the business realises the full potential of the service. Until then, we have had fun in preparing it for our own production use as it is such a joy to work with!

One thing we questioned was how we get a list of our SLA Domains (as we’ve made a fair few) and their contents. This could be useful in the scenario of someone accidentally deleting policies or machines out of policies. Another potential use case could be if we needed to ‘rebuild’ our Brik SLA configuration in the event of a major failure – highly unlikely but better to be prepared and have committed some brain cycles to it, right?

With that in mind, my esteemed colleague @LordMiddlewick has written some PowerShell scripts with the help of @joshuastenhouse previous blog posts about using Rubrik RestAPIs .


Backup Script

This script can be scheduled to run at your own convenience. Ensure that you fill in the variables in the top section for your own environment. It is possible to encrypt the password within the file itself, this can be achieved using a methodology described here. We have only encrypted it for transmission to the Rubrik service in the case below for simplicity.

The key take aways from this script in whatever fashion you run it are:

* You receive a bunch of .txt files, for each SLA you have defined, in JSON format. Useful for restoring SLA’s. Here is an example:

* Another take home is the file “VM-SLA.csv” which contains a list of all your VMs that are backed up and to what policy they belong. This is really useful restoring VM’s into SLA’s or bulk importing VM’s into SLA’s.

Restore SLA Domain Policies

To reverse the backup process and restore an SLA or all of your SLA’s into the Rubrik, use the following script:

This script will take any .txt (SLA Backup files) in the designated $path and try and create it back on your Rubrik.

Restore/Import VMs to SLAs

The final part of this excercise is to be able to restore a list of VMs that have been pulled out, against the SLA domain policies that you have. The following script does this by using the above “VM-SLA.csv” to import a list of objects in and assign them as per the csv.

The format for the VM-SLA.csv file is as follows:

In theory, if you have lots of machines you want to bulk assign to any given policies, you can create your own CSV and run it to import your VM estate to your predefined policies using this script. We used this several times when assigning 100+ objects to a given policy and it worked a treat!

Disclaimer:Please try to fully read and understand the above scripts before implementing them. You should test them fully first in a development environment before implementing in any production sense. I/we do not take any responsibility for rouge administrators stupidity.

I’m sure as Rubrik continue to steam ahead with excellent releases, infact they might evne build in some of this functioanlity making these scripts redundant. In the meantime hopefully someone finds these scripts useful, I know we have. Once again big shout out to @LordMiddlewick for writing this and giving me permission to post it and also to @joshuastenhouse for his blog https://virtuallysober.com .

VMworld USA 2017 – Wednesday Breakdown

Day three at VMworld was a bit of a slow start for me, the Rubrik party was a late one and there was no keynote so I decided to rest up a little try and save my energy.

Hanging out in the community areas, which is the best part of the event, was high on the agenda. Early on in the day we swung by to see our favourite Cloud Cred lady Noell Grier . I gave her a bit of a hand doing some “booth babe” duty whilst Rob Bishop collected his Go Pro 5 that he won for completing a CloudCred challenge! Noell is an awesome lady and if you aren’t familiar with CloudCred then you should go to the site, sign up, follow her on twitter and get on it!

The main highlight for Wednesday for me was heading to the customer party. Thanks to #LonVMUGWolfPack shenanigans Gareth Edwards, Rob Bishop and I ended up wearing some very jazzy VMware code t-shirts. The concert was a blast and we had a great time, I really enjoyed Blink 182 despite not being allowed on the main floor. Here are some pics of the event:


(Credit to Gareth for some of these pictures, thanks dude!)

NSX Performance
Speaker: Samuel Kommu
#NET1343BU

Samuel starts by a show of hands and it seems that most of the audience are on dual 10Gbe for their ESXi host networking.

NSX Overview of overlays
There is not much difference between VXLAN encapsulation and original ethernet frames. Only the VXLAN header is extra.
With Geneve Frame format there is an additional options field (length) that specified how much more data can be packed into the header. This can be interesting as you can pack extra information within it. This then helps capture information on particular flows or packets.

Perfomrance tuning
Parameters that matter – MTU mismatch is a pain to try and figure out. There are two places you can set it: ESXi host and on the VM level. From a performance perspective the MTU on the host doesn’t matter unless you change it at the VM level too.

There is a large chance if you change the MTU you will change the performance on your systems. The advice is to change the MTU to recommended values. The reason for this is the amount of payload vs. headers goes down therefore you are getting more for your money.

The vDS MTU sets the host MTU as that is what the host is connected to. The underlying physical network needs the same MTU setting too. Fairly standard stuff but important to check and consider.
Optimizations on TCP/IP , sending a large payload without spending CPU cycles. This is TSO. The act of sending a 1MB file for example, doesn’t happen within the system but it happens on the pNIC when chopping it up.**

With ESXi 6.5 they have brought in LRO in a software LRO rather than having the physical hardware only having it. Now it is possible to leverage LRO without physical capability on NSX 6.5.
When RSS is enabled
– Network adapter has multiple queues to handle receive traffic
– 5 tuple based hash for optimal distribution to queues
– Kernel thread per receive queue helps.

Rx/Tx filters
– Use inner packet headers to queue traffic

Native Driver – vmklinux driver data gets translated to vmkernel data structure. The native driver decreases the translation between both. Meaning less CPU cycles used.

Each vNIC now has it’s own queue down to the pNIC, rather than sharing the same queue. This scales throughput accurately through to the pNIC. It is also now possible to have multiple queues per single vNIC to pNIC.

Compatibility guide

The HCL is an obvious place to start with checking versions to ensure they are all correct and in support. It is then possible to select the right settings so that you can receive the latest and correct drivers to download and install onto your hosts.

Traffic Processes
Traffic flows, E/W and N/S traffic. E/W means a logical switch communication within the same switch to other VM’s .This is usually the most amount of traffic, smaller amounts go out on N/S traffic flow and this also goes through NSX Edge.

Long flows:
– Designed to maximums on bandwith
– Logs
– Backups
– FTP
Short flows:
– Databases, specifically in memory ones or cache layers.}

Small packets:
– DNS
– DHCP
– TCP ACKs
– Keep alive messages

Not all tools are able to test the latest optmizations. Make sure the tools are right for the job. Application level is often best but be aware.
PIC OF STUFF
Fast Path
When packets come in, a new flow, it has different actions depending on the header. This happens throughout the entire stack regardless of E/W N/S traffic.

When you see new flows that are similar type, fast path disregards the flows actions and fast tracks to the destination, with no hash table. This is for a cluster of packets that arrive together, the flow is hashed and then sent via fast path. This causes 75% less CPU cycles.

The session got quite deep at times and went way further than my limited NSX experience could take me. I’m also not a network admin by day either so if there are any mistakes in my notes I’ll correct them as I go.