There are many ways to obtain resiliency in the world of cloud-native applications. Some prefer to handle everything in the app-layer (this is the true 12-factor approach), others prefer to leave things more to the underlying infrastructure. No matter the preference, I see lots of customers stretching k8s across physical locations, often without proper architectural consideration.
So how DO you architect for a proper multi-site environment? This seems to largely depend on the lens you’re looking through: While a developer tends to fix things in the layer he or she knows (the app itself), infrastructure-minded people will tend to rely on infrastructure features more. There is no good or bad, but there are many different ways. Some better than others, some simply not even supported.
But first, let’s look at the difference between replication and metro:
The “olde” Dual-Site approach: Replication
From an infrastructure perspective, things have always been focused on this “dual DC approach”: You would run on one, replicate to another (either synchronous or asynchronous) and failover if ever needed (disaster recovery).
This case it rather simple compared to the metro use case (see below), but also has implications. You can use completely separated systems in both DCs, only coupled through the storage replication layer. That keeps the architecture at least simple.
The devil is in the details: As you replicate storage form one site to the other, the receiving site will be receiving any written changes made at the source site. This also means that volumes that are being replicated will be read-only at the destination site.
Failing over a replicated application
In case of a failover, what will you be required to do? Well, first of all you either need to break (disaster) or reverse (controlled failover) replication in order to be able to promote the volumes to write enabled. Next you need to somehow recover your workloads and start them. VMware even made a full-blown product for this: VMware Site Recovery Manager. This tool would automatically handle the storage layer, find the VMs on that storage and restart them.
The work needed on the storage layer tempted many people into the world of metro, as it seems so much less complex at first sight…. But truth is, metro is actually far more complex to architect, maintain and grow.
The “olde” Dual-Site approach: Active/Active Metro
When architecting for metro, you would first implement an active/active metro storage configuration, usually stretch your hypervisor across the two sites and architect it out further, possibly be able to do disaster avoidance. I have seen MANY customers who were chasing after the perfect metro solution and yes this exists. It is not easy to obtain, as metro seems SO simple but in fact it has many MANY considerations such as site-awareness, site preference, multi pathing (cross connect paths between DCs or not), uniform or non-uniform access methods depending on the storage solution used. What about witnesses? You’d need a third (maybe tiny) site for that, plus the required separated networking between them. And speaking of networking: How will your apps reach the internet? Through one of the two sites? If that site fails, how will your network cope with the now severed sessions as you failover the connectivity to the remaining site?
All of this makes metro a truly complex architecture, but a very resilient one if done right.
Does metro work for k8s?
The real question should be: Can we 1-to-1 project the metro we know and love onto a K8s environment? Will a k8s-based workload benefit in the same way as a vSphere hypervisor based metro solution? The short answer is: It will not (unless you get into the interesting discussion to layer a hypervisor under your k8s clusters and use that to obtain true app mobility). K8s is a different world, different abilities and limitations. By far not as sophisticated (read: enterprise-ready) as the “old world” of active/active metro with a stretched out vSphere on top.
Kubernetes: The power of three?
Now we make the switch to CNA’s. In most cases (if not all by now đ ), this means kubernetes in some shape or form running containers in pods. Can we project our dual-site approach more or less directly onto this different world? As with all IT-related things, the answer is “it depends”. You can maybe suspect there might be an issue with the idea of “2”, as k8s tends to do everything in odd numbers (so “3” in most cases) .
Let’s briefly look at some ways we could make things work, and then I will detail those out in future posts.
K8s and replicated storage
In case of storage replication, usually the compute that sits on top is separated between the two sites. This will work beautifully for k8s architecturally; you can build two separate, local clusters of your favorite k8s platform and just replicate the data between them.
There is a small caveat in this approach: K8s does NOT understand multi-site, and it does not understand storage replication. Many vendors are solving this in different ways. As #Iwork4Dell: The Dell solution is using separate modules called CSM on top of the k8s deployment in order to make k8s “storage replication aware”. The module to look at in this respect is the CSM Replication module, but there are many more to check out đ
The one issue to solve here is how to get application definitions to the secondary site. If you deploy all k8s things using pipelines, you could deploy left and right, scale the secondary copy to zero replicas and have it on standby. If you deploy manually it might be a more complex task to keep both sites in sync.
K8s and active/active metro
This is where things get really interesting. Active/active metro spans two sites, where any given “metro’d” volume is accessible for both read and write on both sites at the same time. Normally you would not we writing to both sides at the same time (unless you are a VMware vSphere hypervisor using the VERY smart VMFS multi-writer filesystem), but any failover in the world of k8s could be to simply kill the app on one site, and restarting tit on the other site. As both sites are always R/W enabled, you would not require any actions on the storage layer in order to get this to work. This seems VERY appealing to many customers.
“Recovering a k8s cluster may involve restoring etcd”
The headline above might sound pretty scary, but it is what a customer was told by a leading k8s platform vendor. This is what you might get into when you try to squeeze “three into two”. The customer in question had a two-site setup, an active/active metro storage solution, a VMware vSphere hypervisor layer and a k8s platform “thrown on top” of that. No pinning of control plane nodes, no pinning of working nodes, “it just works” they told me :O
I urged them to talk to their k8s platform vendor to make sure this was a supported solution (knowing that it wasn’t): If you spread three control planes nodes across two sites, best you can do is to have one control plane node on one site, and the remaining two on the other. If the site failed that carried the two control plane nodes, k8s simply might never recover, requiring a restore of etcd to get it back up and running. Imagine you have 100+ clusters (which was the case) and 10 of those would not survive a site failure. How much time and manual effort would you require to restore operations? This is NOT why you’d build a metro solution!
So what WOULD be needed? The answer is: Three sites, no different from a decent metro storage solution that also needs a third site for its witness (yes – a decent active/active metro solution can function without a witness, but there would be severe uptime implications for certain failure scenarios – not to mention a more complex architecture to maintain).
Assuming you can have access to a three DC setup, AND you can obtain a <=2ms RTT latency between them (this is to satisfy etcd chattiness), you could actually use two DCs to have active/active metro, run a control plane node on each of them, and use the third site just for witness and the third control plane node (or nodeS if you deploy multiple stretched k8s clusters).
What is really cool in stretching k8s clusters is that your app just deploys to just one place: the stretched cluster. If a DC fails, the cluster looses half of its nodes (assuming you built it symmetrically) and k8s can failover the workloads to the remaining nodes (hint: you may want to look at the Dell CSM Resiliency module as k8s with persistent storage attached plus failing worker nodes do not give an “enterprise level” functionality out of the box, effectively meaning k8s will just sit there and watch your app being down, not taking (nor being able to take) appropriate actions to recover).
Final thoughts on “Stretch or Sever”
First thing to note is that K8s IS different from vSphere. Different abilities, different requirements. This (should) result in a different architecture in order to make it work properly. No matter what you think or are comfortable with, you NEED to consider this as a different use case and architect accordingly.
Stretch (metro) or sever (replication)? Both have pros and cons, and there is no silver bullet. It would be a good start to understand that metro and replication SEEM alike, but in fact are VERY DIFFERENT architectures.
From my personal perspective, I would say that metro is far less “cool” in a k8s environment compared to for example a VMware vSphere environment; we (Dell / EMC) have been thinking together with VMware for SO long on how metro should really work, what about site-awareness, what about loosing storage paths, the discussion between APD (All Paths Down) versus PDL (Permanent Device Loss) and how to respond to those in different ways… So complex but so cool once it finally all worked out. In the world of k8s, we are many miles, I might say lightyears away from obtaining such a north star, it feels for me as if we are jumping back in history some 10 years, and I see (way too many) wheels being reinvented by developers who don’t know infrastructure and its abilities.
Why is metro in k8s way less cool? Consider moving a workload in k8s today (and the foreseeable future): You basically kill all of the containers, and restart all of the containers elsewhere. There is no “magically” moving workloads like VMware’s vMotion is capable truly without downtime. So why have metro to begin with for k8s you might think?
I see more and more customers (and vendors btw) move toward a simpler alternative to “true” metro which still checks almost all of the boxes for metro (at least in the world of k8s): Two separated (non-stretched) k8s clusters with synchronous replication between them, with some form of automated failover. This might actually become the sweet spot for CNA storage resiliency in the years to come; it requires only two sites, no stretching of k8s clusters, no (manual) work on the storage layer when recovering from a disaster. It is a bit of a mix between “stretch” and “sever” and just might be the golden path forward.