VMWare vSAN data protection

VMware vSAN is a software-defined storage solution. It offers a clustered shared storage solution by utilizing a set of local server drives by pooling them together, this allows saving on rack space and energy consumption. vSAN offers data deduplication and compression, multisite deployments, performance policies, and other VMware native tools, which are manageable through a unified vSphere HTML5 user interface.

One of the hardest concepts to understand for vSAN is the way how data protection works. It is a mix of definitions like failures to tolerate, primary, secondary failures, different RAID levels are used. To clarify these let’s look into how does data protection work in vSAN.

Hardware protection vs VM

Data protection in vSAN works a little differently than it is on your regular data array.

The regular array will protect its data by applying different RAID levels for the whole array or volumes. By utilizing RAID it ensures that in the event of a disk failure, the data still remains available. In conjunction with spare drives, this makes a pretty resilient solution, and this makes sense as the, usually, single array is serving a bunch of servers.

In vSAN this is different, it also uses RAID, but not for the protection of data. RAID is not suitable for this, for example, if NVMe drives are being used, it would just use too many CPU resources. Also, you cannot create a RAID array using a regular server storage controller, as the controllers are just too slow for the NVMe’s, such a solution would beat the purpose of having NVMe’s in the first place.

vSAN protects data not globally, device-wide, but at the virtual machine level. Each VM has its storage policy that determines the level of protection that VM has, this policy ensures that the VM data is replicated among other servers in the vSAN cluster. In the event of a failure, VM data will be available on a different server (and in conjunction with vSphere HA also restarted there).

This way, it is possible, to achieve greater granularity of data protection, there’s no need to “think ahead” when creating a storage pool for different applications, you protect only what is consumed.

Raid levels

As mentioned, RAID does not participate in data protection, in the context of vSAN RAID is used on how the VM data is distributed between the servers in the cluster.

In a scenario of RAID1, each VM has a replicated copy (mirror) located somewhere else in the cluster. This ensures data high availability should a server failure occur.

Alongside the VM replica, a small witness component is also created and stored on a different server in the cluster, this is why for the RAID1 3 servers are required. This ensures server quorum.

Speed-wise RAID1 is the fastest, only primary VM object is processed with reads and writes, where the replica is updated accordingly. That is why RAID1 is used for storage sensitive VMs, it also consumes most space. Each VM consumes 200% of space, 400% in stretched cluster deployments.

RAID 5/6

When RAID5/6 policy is used, VM does not store copies among cluster, instead, it is “chopped” into 4 or 6 pieces (depending on the raid) and distributed in the cluster.

The main advantage RAID5/6 has over RAID1 is that it consumes about 30% less storage. Each VM will now require 133% of space. However, this comes with a drawback – write amplification, as when data is updated the parity block also has to be updated. This introduces additional latency and disk wear, due to this RAID5/6 is only available on all-flash nodes. RAID5/6 commonly is used for VMs that are not that storage sensitive or to save space. Server requirement for RAID5 is 4, RAID6 – 6 servers.

Failures to tolerate

In vSAN, failures to tolerate determines the number of failures, within the failure domain, that each virtual machine can tolerate in order to be available, where failure domain usually is a server that VM data is hosted on.

Most commonly used vSAN data protection policies:

FTT=1, RAID=1

In the event of a server failure, VM data is still available on a different server, hence RAID1. As mentioned, this level of protection requires 3 servers, 2 for data and 1 for witness.

FTT=1, RAID5

In RAID5, when an error occurs on server 3, VM data is still accessible through the parity on server 4, it still falls under the FTT=1 policy.

From this, we might assume that RAID1 and RAID5/6, relatively does protect data, as both levels have the same number of failures that they can tolerate, but it is important not to mix each other as it is possible to create a FTT=2 RAID=1 deployment, where a VM will have 2 copies, or FTT2 RAID5/6, where VM will be divided into 4 data and 2 parity blocks.

In the case of 2 or more server failures, VM data will no longer be available unless FTT=2 policy is applied prior to the failure.

Policies and rebuilds

vSAN policies are applied for the VMs running on vSAN cluster, this is the step where the protection level (FTT=1/2 RAID1/5/6) is set for the VM. Additionally, other parameters can be set, like, performance, data reduction, and more.

In the event of a server failure, the virtual machine still holds the properties of a policy (in this case FTT=1, RAID=1), if the failure is not resolved and the server is not back healthy in the cluster within 1 hour (default, can be changed), to ensure compliance with policy, VM will be rebuilt somewhere else in the cluster.

Due to this, it is important to plan storage consumption accordingly for failure or maintenance events. As a general rule, VMWare recommends having at least 1 additional server in the cluster than the minimum RAID level requires. For RAID1 it would then be 4, RAID5 – 5, and RAID6 – 7 servers.

From this, in context of vSAN, we can conclude that RAID is only being used to determine where data is distributed between servers, not to protect data. All the data is protected by using Failures to tolerate method. Data protection is done at the VM level and is defined by a storage policy.

To familiarize more with vSAN it is possible to apply for a VMware vSAN Hands-on-labs, where you can manage and configure a vSAN cluster yourself, but if you want for us to tell you about vSAN ping us in the About us section.