Squeezing three-AZ K8s into 2 AZ's using vSphere

Even though Kubernetes is the “next cool thing”, I still see many customers sticking with Broadcom/VMware as the underlying layer. Having been a vSpecialist for over 10 years, I can relate: It is an amazing piece of middleware. Don’t have a third DC, but still want to install Kubernetes across two? With VMware vSphere, you can!

The Tertiary Site Paradox

Kubernetes does everything in 3’s usually; Quorum, majority of nodes rules all. So how do you apply this to a dual-datacenter approach? We see this question many times over, and people are getting creative (and sometimes a little bit too creative 😉 ). Some of the options you have, is to simply squeeze 3 AZs into two sites, but without proper tooling around that it could hurt (up to a point it doesn’t make sense and other approaches would make more sense). You could also look at solutions like Portworx (disclaimer: #Iwork4Portworx). Portworx will allow you to perform either asynchronous or even synchronous replication between two separate Kubernetes clusters, including a (for all you VMware fan sou there) VMware Site Recovery Manager-like recovery!

Still, many many customers will continue to trust their Active-Active Metro VMware setups to run VMs… And containers. If this is you, then this blog is for you!

The Active/Active VMware Metro layer

The VMware layer is something usually already present as customers have been running their VM workloads that way, and are now looking into extending their capabilities into persistent Kubernetes. Some people will just “throw on” something like OpenShift into VMs on the metro cluster, without considering location of the control-plane nodes. Question is: Is this smart, ok, or even not-so-ok? The answer lies in latency, your trust in etc and a little bit of architecture.

So does it matter where your control plane instances are running? Following Kubernetes logic you should try to spread the control plane nodes across as many AZ’s as you can get your hands on. If you only have two you’d think that one should run in one AZ, and the two remaining ones in the other AZ. But what happens if that second AZ is impacted? etcd makes no secret of this: Loose 2 out of 3 loose majority and etcd will go down. And this is where VMware comes in, HA or High Availability specifically.

VMware High Availability

In this setup we are running a stretched VMware cluster across two AZ’s (usually two datacenters) with storage stretched underneath using some kind of active/active metro stretching:

Figure 1: VMware stretched cluster with an Active/Active Metro storage configuration. In this example a pair of Pure FlashArrays are used in an ActiveMetro configuration, but others works as well (although you mileage may vary 😉 ). ESX gets its storage from a stetched LUN, feeding into the VMs that run in the cluster. In this example two “should run on” groups were created to make sure the VMs stay in the correct DC unless all ESX servers in that DC became unavalable. Only in that case VMs will restart in the “non preferred DC”. So far so good!

Piling on Kubernetes

If we now pile on Kubernetes, we can actually specify which workers and control-plane nodes should run where by placing them into the correct “should run on” group:

Figure 2: Piling Kubernetes onto a stretched VMware Active/Active Metro cluster. Control-plane nodes have been “pinned” to their respective site (one in DC1, two in DC2) in an attempt to spread the control-plane nodes as much as possible between AZs.

As you can see, we are now attempting to spread the risk for failure, spreading the control-plane nodes as much as we can. But does this make sense in a VMware setup? Not what you’d expect…

Control Planes and two AZs with VMware: Not what you’d expect!

So in this entire blog I have been pushing to spread control plane nodes across as many AZ’s as I can find: I’d even prefer a 3rd (mini)site for control-plane nodes / witnesses etc. But does this make as much sense in a 2 AZ deployment with VMware A/A Metro? The short answer is: It doesn’t.

Don’t judge too soon: I would actually propose to move ALL control-plane nodes into a SINGLE AZ:

Figure 3: Two-AZ setup, VMware is in a stretched Active/Active Metro configuration. All control-plane nodes have now been pinned to a single AZ. You might not think it, but that actually makes sense!

Call me crazy, but here is my logic: Compare Figure 2 and 3 and consider etcd behavior:

In figure 2, if we lose DC1 nothing happens (two control plane nodes remain which is a majority). If we lose DC2 we loose quorum and etcd goes down. We have to rely on VMware HA to restart control-planes to regain majority and resume;
In figure 3, if we lose DC2 nothing happens (no control planes there). If we loose DC1 we lose quorum and etcd goes down as we now just lost all three control-plane nodes. We have to rely on VMware HA to restart control-planes to regain majority and resume.

See how similar these two situations are? VMware HA is very good (and smart) at restarting workloads. The upshot of figure 3 is that all etcd nodes are all in the same AZ, so minimum latency plus a way smaller chance of them getting network severed. Exactly this is my logic to place all control-plane nodes onto a single AZ compared to what your instinct might say.

Portworx into the mix

No dual-cluster going on here, no requirements for a separate DR requirement. Would Portworx still make sense in this secenario? In my opinion it would! The Active/Active Metro setup described here, will not protect you against logical failure (like someone blowing up the Kubernetes Cluster’s etcd). Stretching means both sites die at the same time. Also, for stretching to work you generally need to be within 10ms RTT, so chances of a disaster striking both AZ’s is realistic. With Portworx DR you could add an additional THIRD AZ at a distance; you could asynchronously replicate (and DR failover if needed) to a third, “very remote” site. This site could be like across the country, in another country or even in public cloud. On the Mars as far as I care 😉 Now how cool it that?

Figure 4: As Active/Active Metro setups need to be “close” and generally offer less protection against logical failures, adding a “very remote” third site with async replication can make a lot of sense… Powered by Portworx!

As Portworx builds a storage abstraction layer (who’s look and feel is very much in parallel with VMware’s VMFS). On top of this the Portworx layer actually delivers all data services, which means the underlying infrastructure can be different, even the Kubernetes flavor may be different and we can still replicate and DR between them! The storage abstraction layer also solves the scaling issue around the number of disk devices you can add to a VM: Portworx will happily carve out thousands of PVCs from just a few virtual disks attached to a worker VM!

The list goes on in favor of Portworx. Smarter backups, encryption, RBAC, multi tenancy, data management using autopilot etc etc. Maybe in a next post 🙂

Squeezing three-AZ K8s into 2 AZ’s using vSphere