Snapshots and recovery points

If you are familiar with VMware vSphere you’ll appreciate the ease of creating a snapshot of a virtual machine and being able to ‘roll back’ to that point in time if something bad happens soon after. Microsoft Hyper-V has a similar concept with checkpoints.

Nutanix with AHV on a cluster level (Prism Element) supports the same type of functionality, take a snapshot, do a bad thing to the virtual machine, roll it back to the snapshot from before it was working, try something different.

However….

Nutanix, when you move to multi-cluster management with Prism Central did not provide the same ‘simple’ snapshot mechanism from that central management pane. Sure you could dig down into cluster x of y and execute the snapshot creation and roll it back but that’s not great when you want to orchestrate and manage from a central place.

A customer scenario required a self service (developed inhouse with their ServiceNow portal and Nutanix native API) request system to allow end users to request adhoc snapshots of virtual machines and rollback without involving IT.

As the customer has a few hundred clusters in multiple locations it is not feasible to expect the end user to know where their virtual machine(s) are located – they just know they are important to them. The numbers involved here are around 120k users each with 1 or 2 virtual machines (virtual desktops infact) that they use each day to carry out their role.

If we use the Prism Central v3 API want to take a ‘snapshot’ of a virtual machine what we actually get is a ‘recovery point’. A recovery point is a very good thing because it is possible to replicate this recovery point to another cluster, or another country for protection. It can also be used to clone a new virtual machine for testing, etc. All sounds great, from a data protection or DR-esque perspective.

Unfortunately, this did not include the simple roll back we were used to in single cluster management. In fact, if you cloned out the recovery point to a new virtual machine it would have a new UUID, which for all intents and purposes quite rightly means it is a new machine. Our configuration management database relies upon those UUIDs to track virtual machines because sometimes the ‘friendly’ name can change (e.g. if it moves from DC A to DC B or from Geo 1 to Geo 2 but that UUID is fixed).

Nutanix though clearly understood how having recovery points is great for disaster recovery and protection, but its not efficient for a simple snapshot and roll back.

In a recent Prism Central release ‘revert’ has been introduced that allows us to use the proven power of a recovery point as a snapshot to in effect ‘have our cake and eat it’. This is available within the Prism Central UI currently, and whilst the API call exists it is not functioning properly at the moment but this is hoped to be fixed soon.

There will be a blog post working through this whole process end to end in the near future.

Of course, the obligatory mention of snapshots not being backups