Continuing the brief history of metro from part 2 of the series with some things that really help in building out a proper active/active metro environment: “sidedness” and “witness”. Both help in recovering from certain failure scenarios while making sure there is no possibility of split brain.
(and if you are not a fan of reading – You can watch the video below or HERE).
When looking at even the simplest failure scenario in a metro environment, the question always is “how will the system respond”. In a replication scenario the failover is usually done by a human-made decision (It should be a management decision and for now I’ll just assume managers to be human too 😉 ).
A metro solution is different; any failure that needs an action performed will in fact trigger that action. There WILL be a response. The real question is: How smart will that response be. And this grabs back into “sidedness” and “witness”.
“sidedness” – How to recover when “the other side went dark”
Imagine a metro system without any preference for datacenter A or B; in this case any failure would cause data unavailability. Consider this setup:
Here we lost the interlink, so connectivity between the two datacenters is now broken. Both arrays now have a question to answer: What happened to the other array? Without more knowledge, and in case there is no “side preference”, the only thing the arrays could do is stop access to ALL metro volumes. As this is not really an acceptable solution, an idea of “sidedness” or “preference” is added:
In the example above you can see that each volume on a metro-enabled cluster gets a “preferred” side. Here we have two metro volumes, and one volume is “preferred-left” while the other is “preferred-right”.
Failures with “preferred” metro volumes
When using “preferred” metro volumes and we think about the same failure as before (loosing the interlink), the response is now different. As both arrays see the “other” array go dark, they have no clue if the other array is down, or still operating but fractured because of a failing interlink.
In any case they can continue to serve I/O on their preferred volumes. As each volume can only be preferred on one of the two sites, I/O continues on the preferred sites without a risk of getting into split-brain.
This is cool, but far from perfect. Imagine the next failure scenario: Failing an entire array (or even an entire datacenter):
In this scenario, a witness may help.
Adding a witness service
A witness would run on a third site. It is a lightweight service that in the case of the Dell PowerStore is just a service you can run on a Linux machine. It talks to the storage arrays over DIFFERENT PHYSICAL LINKS (not the mirror-link!) and it will act as a referee to determine what actions to take.
If an entire array fails in a metro setup, the remaining array is allowed to continue to serve I/O on both preferred but ALSO on non-preferred volumes. The remaining system would use the witness to determine whether the remote array is still serving I/O (but just disconnected) or is down altogether. In our case here where we failed an entire array the following would happen:
As you can see, the remaining array sees the remote array going dark. But this because of an interlink failure, or an array failure? The witness can help here: In this particular case the witness also lost connectivity with the remote array, so the witness tells the remaining array that the remote one is now down, and it may continue to serve ALL I/O, including the non-preferred volumes.
This helps a GREAT deal in making the system respond in an optimal way. Dell recently added witness support for the PowerStore series of storage arrays in version 3.6 of the PowerStore software.
APD versus PDL
Still, the complexity continues. Because in several failure scenarios, we may encounter a situation where the metro cluster forcefully disables access to one or more volumes (as they are non-preferred). But without the cluster that runs workloads on top knowing… It won’t be able to effectively recover. This is there APD comes in.
Let’s first look at the abbreviations: APD stands for “All Paths Down”, and PDL stands for “Permanent Device Loss”.
APD is simply what happens if a workload has one or more storage paths to a volume, but all paths are reported to be “down”. Now what? The workload will basically just sit there, as the paths may return at any time. This is unwanted behavior in some metro failure scenarios: As the storage platform sometimes proactively stops serving IO from certain volumes (to avoid split brain), it would really help if the storage platform was able to tell the workload that a volume is down, and won’t be coming back any time soon. This is PDL. PDL is a SCSI signal to the workload telling that a volume is down and will not come back any time soon. The workload (or rather the cluster running the cluster) can now take immediate action to possibly resolve the loss of the volume. An even smarter way of recovering from failures!
This is all for this episode. If you are not so much into reading but rather watch this on video, check out below!
2 thoughts on “Blog Series: Metro – A brief history of Metro pt.3”