In the previous “a brief history of metro” blog post (part 1) I discussed replication and metro and how they are “the same but different”. In this episode I will look at the two ways a storage solution can architect metro through uniform and non-uniform access methods.
(And if you are not a fan of reading – You can watch the video below or HERE)
Simply put, there are two ways of building an active/active metro platform: Using either the uniform or the non-uniform access method. To be honest, the uniform access method is actually more an active/passive metro setup, as each volume is handled by an active controller while the other (in the other datacenter) is standby. You may see issues on the uniform method already starting to show (and yes – #iwork4dell – I have always been a huge fan of the non-uniform access method).
The Uniform Access Method
The uniform access method goes something like this: I have an array with two controllers, but ultimately I want to stretch that array across two sites. If I took a saw to that array and sawed in in half, I could move one of the controllers to the other site. Any wires I cut when sawing I can restore across the WAN, and I have a metro setup.
Sounds great, but there are some serious drawbacks to this approach. Imagine you break up a dual controller array and place one controller in each site. Now ANY failure of just a single controller would cause a recovery response. With that you just implemented a SPoF (Single Point of Failure) to your architecture, not something you’d want when you are taking all of this trouble to build a metro setup.
In addition this is more of an active/passive metro setup; in this architecture a volume is always handled by just ONE controller while the second controller is a standby. This can cause situations like depicted below:
In the image above you can see how two controllers have been stretched across sites to build a uniform, active/passive metro setup. There is a complex WAN structure required to glue the array’s controllers “back together”.
In the case above you can see a workload on the right datacenter while the active controller for that volume is in the left datacenter. As this metro architecture is really active/passive, The I/O will continue to flow to the same controller even if you move the workload to the other datacenter. In this case all reads go across the WAN, which induces unwanted latency and a controller failover would have to be done to rectify the situation. For writes it is even worse: First the VM traverses the WAN to write, but now the write needs to be mirrored. So the system traverses the WAN two more times (the write and its acknowledgement). Finally the system traverses the WAN one extra time to acknowledge the write as a whole back to the VM. That is an “impressive” FOUR times across the WAN for a SINGLE write! There must be a better way of building out metro… And of course there is: The architecture that uses the non-Uniform access method.
The Non-Uniform Access Method
This second and more advanced version of metro can be a true active/active metro setup, without SPoFs and minimal latency induced because of the WAN. This architecture is slightly more expensive, as it uses two full-blown arrays instead of a single sawed-in-half-one:
In the non-Uniform access method there are two full-blown arrays, one on each site. On top is more “magic” to create the actual mirror. Back in the days that might have been an EMC VPLEX system, but nowadays we see this “metronode” functionality more and more being collapsed into the actual storage array itself (as is the case for Dell PowerMax and Dell PowerStore today).
This method is called the non-uniform access method because the I/O moves with the workload. As a result reads are ALWAYS local (no WAN latency) while writes (needing to be mirror) will traverse the WAN just two times (the actual write and the acknowledgement back).
Furthermore this architecture has no SPoF; as each array is a full-blown array with a minimum of two controllers, any single controller failure won’t lead to a failover / recovery event as that problem will be handled locally by the array itself; the metro configuration won’t have a need to act upon single failures.
Finally let’s talk WAN connectivity. In this architecture the WAN can be radically more simple compared to the uniform access method metro; here we can (and in most cases should) suffice with workloads only seeing their local peer storage array and nothing more; the only thing crossing the WAN in that case would be the link that mirrors the data.
In all respects I have always been a much greater fan of the non-Uniform access method; it is how enterprise-level DCs are built in my humble opinion.
Needless to say, all metro solution Dell offers today are of the non-Uniform access method architecture. We are looking at other things like active/active/active metro using multiple nodes per site, but these things are right now “under consideration” 😉
You can watch the video for this episode below, or skip to part 3 directly.
5 thoughts on “Blog Series: Metro – A brief history of Metro pt.2”