A Scary Thought: Etcd's Good, Bad and Ugly when power fails

Kubernetes owes much of its reliability to etcd—the distributed key-value store that backs the entire control plane. When etcd is healthy, Kubernetes is healthy. But when etcd members suffer abrupt, simultaneous power-downs, things can go from troubling to catastrophic very quickly.

This post walks through the Good, Bad, and Ugly realities of etcd behavior after disruptive outages, especially in on-prem environments, edge deployments, and clusters without enterprise-level storage platforms.

The Good 😇

etcd uses the RAFT consensus algorithm to ensure consistency across odd-numbered clusters (usually 3 or 5 nodes). So etcd + RAFT = Eventual Recovery… That is, if the state survives the crash by which the outage was caused!

RAFT provides a strong safety guarantee: If a majority of members fail and later return with intact, matching data, the cluster will elect a leader and recover automatically once that majority is online. This is what you can observe if you kill a majority of control-plane nodes on a Kubernetes cluster, and then return them to service. Kubernetes recovers because etcd recovers… Usually.

For example:
A 3-node etcd cluster loses all 3 nodes due to a facility-wide power outage. After the power failure, at some point the power returns. Now all nodes reboot with uncorrupted write-ahead logs (WAL files) and consistent snapshots. In this scenario: RAFT re-establishes quorum. A leader is elected and etcd resumes normal operations. K8s springs back to life.

This is the ideal world – the expected consistency model working exactly as designed. Unfortunately, real-world outages rarely behave ideally. Those who have been in IT long enough will know: If things can go sideways, at some point, THEY WILL. Which leads me into “the Bad” part.

The Bad 😈

Real Outages Often Corrupt State and Break RAFT’s Guarantees… While simultaneous failure sounds harmless in theory, in real systems in real scenarios these things might occur:

Unclean I/O shutdowns;
cloud instances without write-through guarantees dump writes;
virtualized environments with thin-provisioned storage that breaks halfway a commit;
Shared storage without a protected write cache looses writes.

These things can lead to partial WAL or DB corruption, as writes may have been in flight… No fsync() completion in the world of harshly removing power! As a result, WAL segments may contain torn writes or incomplete commit entries. The result? When nodes come back, their logs may no longer match—even if they came from the same cluster moments earlier. RAFT requires consistent logs among a majority. If even a single member returns with:

a divergent commit index
an incomplete WAL entry
a corrupted snapshot file

…and as a result the cluster may be unable to elect a leader, effectively keeping the control-plane from going back online. Symptoms you might encounter include:

endless leader election loops
log mismatch errors
members refusing to join the cluster
stalled Kubernetes control plane components (api-server timeouts, scheduler deadlock, etc.)

In other words: etcd is designed to tolerate node failures, but not corrupted majority state, and this is where things get ugly.

The Ugly 🫣

The Ugly: Manual Disaster Recovery Is Often the Only Recovery Path

When RAFT cannot form quorum due to WAL divergence or corruption, etcd will not fix itself; It was never designed to self-heal beyond what RAFT consensus allows. In these cases, the officially documented recovery path is one of the following choices:

Restore a consistent snapshot: select the most recent healthy backup, rebuild etcd as a new cluster and finally reintroduce Kubernetes API servers;
Manually re-form the cluster by replacing unhealthy nodes. If only one node is healthy: use that node as the new seed, remove and re-add the others as fresh members;

If that isn’t scary enough… Consider that these steps are operationally painful:

Any mistake can lead to permanent cluster data loss;
kube-apiserver cannot run until etcd is healthy;
automation rarely handles these scenarios;
backup freshness becomes critical; restoring an older snapshot may roll back cluster resources or CRD states.

This is why disruptive power events are one of the most dangerous scenarios for Kubernetes control planes—especially without hardened storage or automated etcd backup/restore pipelines.

Why Should You Care?

Why this matters for Kubernetes administrators and platform teams is easy: A Kubernetes cluster will often survive losing nodes, zones, or even a control plane component, but it WILL NOT SURVIVE losing etcd integrity.

… And then it cascades from there. Control-plane outage (no API server access) lead to the inability to schedule workloads or to update / delete pods. Cluster autoscaling will fail, network plugins might fail.

Recommended Best Practice is to use reliable storage for etcd. Prefer any storage with write caching enabled, or even better: replicated and synchronously written. Use three or five control-plane nodes, not one, two or four. Also take regular etcd snapshots and document (and actually TEST) restores / member recreation.

A recovery that never fails: Configure DR?

If you want an easy recovery of a logical failure, DON’T restore or rebuild etcd.. But RECOVER! By configuring Disaster Recovery between logically separate Kubernetes clusters, you can always fail-over to another (not affected) Kubernetes cluster. If you choose your solution right, like Portworx DR… You can recover with a single command… and fix at a later time, without your boss breathing down your neck. In this case you will be migrating workloads to completely different (separate) cluster, with it’s own etcd that is unrelated to the failing cluster, and thus still intact.

Conclusion

Kubernetes is resilient—but only as resilient as its etcd layer. Disruptive power-down events can push etcd from Good to Bad to Ugly faster than most operators expect!

The key message: If etcd loses data integrity, Kubernetes cannot recover automatically. You must be ready to perform manual disaster recovery.

Harden your storage;
Take snapshots;
Test restores;
Never, ever assume etcd will save you from a chaotic power-loss scenario.

References

Etcd Disaster Recovery documentation: https://etcd.io/docs/v3.5/op-guide/recovery/
Etcd FAQ: https://etcd.io/docs/v3.6/faq/
Etcd data corruption documentation: https://etcd.io/docs/v3.7/op-guide/data_corruption
Gardener etcd recovery from quorum loss: https://gardener.cloud/docs/other-components/etcd-druid/recovering-etcd-clusters/
Kubernetes docs on “Operating etcd clusters for Kubernetes”: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/

A Scary Thought: Etcd’s Good, Bad and Ugly when power fails