Blog Series: Metro – Breaking stuff part 1 – Interlink failure

Now that have are done describing the lab in the last episode, we are ready to go out and break stuff! Today we will be looking at non-eventful failures, and the first we will actually be testing in the lab: What if the interlink fails?!?

As always, if you’re not the reading type you can watch the video below 🙂

Non-eventful failures

Let’s first quickly look at the less interesting failures. As we have build our metro solution in the non-uniform way, we have redundant controllers on each site. Failing a controller will therefore NOT trigger any metro response, so I am not going to bother testing 😉

Another uneventful failure is loosing (connectivity to) the witness VM. The witness is just there to cope with certain failure scenarios. Loosing the witness won’t have any impact until a consecutive failure occurs. And even if that happens, Dell’s metro solutions are built in such a way that even then no split-brain scenario could ever exist. Therefore that is another scenario I will not be testing. Failing the metro witness is just boring 😉 . On to the more interesting stuff!

Failing the metro interlink

This is where things become interesting: Failing the interlink:

The first of some more interesting failure scenarios: Failing the interlink that connects the mirroring arrays!

When the link between the twin datacenters fails, both arrays won’t know what happened: Each array will see its twin to down, but is it REALLY down? Or just a broken link? Without a witness, there is no way of telling. So by default the response will be to shutdown access to all non-preferred volumes. It’s the only way to be sure no split-brain can occur.

In a scenario where we DO have a witness, an interlink failure will cause both array to reach out to the witness. The witness will reply that both arrays are still online, and therefore the end result will be the exact same thing: All non-preferred volumes are shut from access at both arrays.

Now imagine you have workloads “just running anywhere”: If your gardener accidentally cut the interlink… Most likely half of your workloads would stop functioning, and never recover (you’d need to manually intervene). So two awesome rules of thumb:

  1. ALWAYS make sure your workloads run at the proper (preferred) site;
  2. Never trust your gardener 😛

You can watch the video I created “breaking stuff – Interlink failure” below.

Watch the first “Break Stuff” video HERE:

2 thoughts on “Blog Series: Metro – Breaking stuff part 1 – Interlink failure

Leave a Reply

Your email address will not be published. Required fields are marked *