Surviving Cloud Outages: What You Need to Know
Cloud outages are occurring more than ever before. Addressing them requires a technical understanding and design patterns baked into your organization’s enterprise architecture.
The unthinkable became common: cloud outages. A cloud outage means essential services such as storage, compute capacity, or middleware are unavailable in one cloud data center or an entire region. With backend and business applications in the cloud nowadays, such outages impact more than just customer-facing web pages and online shops. All applications become unavailable, bringing the complete enterprise to a halt. Employees cannot work, be it in finance, HR, sales, production, customer service, or logistics.
Plus, the outage might not only be temporary. Servers and all of their data are gone forever if a data center burns down. Thus, CIOs, CISOs, and boards of directors must decide which risks to accept and which to address in their disaster recovery and business continuity planning.
Traditional business continuity management (BCM) covers challenges in the physical realm: the absence of a majority of the workforce (due to a pandemic, for example) or floods and fires destroying office buildings and data centers. Cloud outages are a new scenario – and addressing them requires in-depth technical understanding and design patterns baked into the organization’s enterprise architecture.
Outages in the Public Cloud
In the event of an outage, a cloud outage BCM design must ensure the availability of:
Resources such as storage and compute, and
Data and applications .
A particular specialty of clouds (besides SaaS applications) is that they comprise business logic and data in the form of:
Infrastructure-as-a-Service (IaaS) workload with VMs, and
Platform-as-a-Service workload covering (not only) object storage or database-as-a-service.
Public clouds provide features for surviving more extensive outages. Still, IT departments and risk managers must understand the mechanisms and prepare for potential public cloud outages. In the following, we’ll look at selected Microsoft Azure concepts, though other clouds, such as GCP or AWS, are very similar.
Azure differentiates between zonal, regional, and global, always-on services (Figure 1). A zonal service runs in one data center – and the most iconic zonal services are Azure VMs. If you order such a VM, Azure starts and assigns exactly one to you. Therefore, if this data center is down, your VM crashes. There is no redundancy.
Cloud service classes - a resilience perspective
Regional services are the first class of services with some redundancy. Many Azure services for storing data or database-as-a-service offerings in Azure belong to this class. They run (redundantly) in multiple zones within one region. The Azure Zurich region, for example, consists of three zones near the city of Zurich. Local events like a plane crashing into one data center do not bring down regional services. So, regional services can provide limited resilience. However, regional services do not offer protection against large-scale incidents, such as multiple-day power blackouts all over Switzerland.
Global, always-on services such as DNS or the Azure Active Directory play in a different league. Azure guarantees their availability, no matter what happens in the world. Customers do not have to plan and prepare for disaster recovery in the event of a global cloud outage. There is just one catch: very few services belong to this class. Most are only regional or even only zonal, especially services related to running application code and handling data.
Preparing for cloud outages means balancing costs for increased resilience with the probability and impact of an outage. Do you need a regional or global service – or is a zonal service (plus some backup somewhere else) sufficient (Figure 2)? And do all required and used services meet these needs? A one-VM application is an easy-to-spot issue if the application should continue to run even if there is a regional disaster. It is a scenario in which the service SLA and the service description do not meet the business need.
Ensuring resilience
Enhancing Resiliency
Resiliency is critical when faced with cloud outages. When a service does not meet the resilience needs, the pattern is simple: duplicate the service in a different zone or region. If an application running on a single VM should continue when the data center burns down, you need a VM in a second data center a little bit away. If a regional power outage around Zurich should not impact your application, add a VM in Geneva, Finland, or Australia. Just consider two details: costs and rerouting.
Having extra VMs only for emergencies in a different Zurich data center does not help without the ability to reroute incoming requests. The solution: load balancers, which are regional services that run even if one data center around Zurich crashes completely. A load balancing service reroutes traffic to VMs in the surviving data center if the one with the initial VM is down. If you must redirect traffic when an entire region fails, always-on services such as DNS or Azure Front Door are the right choice.
To ease or complicate matters, selected regional services have geo-redundancy features. For example, the Azure Cosmos Database service can back up to and restore from backups in different regions (Figure 3, A) – and even allows engineers to configure the service to be geo-redundant (B).
Azure Cosmos DB -- Making a regional cloud service geo-redundant
It is beneficial to look at the two dimensions of designing failover systems: data and resource capacity (Figure 4). The most expensive design builds up the full capacity for running daily operations in two data centers. One half of your VMs (and other capacities) runs the day-to-day operations, and the other half is idle, hopefully forever. The idle capacity is just there for a failover case. Due to costs, companies implement such architectures only for the most mission-critical systems.
Designing for failover in a cloud outage – The replication and backup dimension and the redundancy dimension
The other extreme is reserving no failover capacity. In other words, the IT department bets that the cloud provider has enough capacity for disaster situations – thereby accepting the risk that no capacity might be available, or only available in far-away data centers with high latency. When going for this approach, you should never forget one fact: a typical cloud outage impacts thousands of customers. It is unlikely that cloud providers have so many idle VMs or service capacities in other data centers in the same region nobody needs (and pays for) on a day-to-day basis. And, there are many options between these two extremes of having duplicate resources for everything or having no capacity reservations for anything.
The second dimension when designing failover systems is the availability and timeliness of the data backups and replicas. The two main replication patterns are “asynchronous” and “synchronous.” The latter pattern updates all data copies in the various data centers before confirming the success of a write-operation. The benefit of a synchronous replica is that all data copies are always up-to-date, and applications can be on standby and active in the second data center (active/active).
A data center crash does not cause any data loss or service disruption. The disadvantage: potential throughput loss due to latency. The slowest data center determines the throughput. Thus, synchronous replicas are typical for replicas within a region, e.g., the various data centers around Zurich, but not from Zurich to US-East-1 in Virginia. For the latter, asynchronous replication is a typical choice.
Asynchronous replication means a cloud service (or your application) commits updates in the local zone; the application continues while the service pushes updates to a different region on the same or another continent. The latency for the update propagation does not harm your application’s performance. However, not-yet forwarded changes are lost in case of an outage. The third, most traditional, and cheapest option is configuring periodic backups instead of setting up replication. In this last scenario, the cloud performs a backup, e.g., every 4 hours or once a day.
So, the replication and redundancy choices impact how well a company survives a cloud data center outage or even a regional cloud outage. Indeed, companies must clarify which of these risks they accept and for which to prepare. The challenge is to design a consistent disaster recovery architecture in the latter case. Cloud services come with different redundancy options. Cloud architects must dig into the details for applications building on five or ten cloud services, even if all are from the same cloud provider, to prepare for cloud outages. In the end, the board of directors expects IT to deliver one of the main cloud marketing promises: no service interruption in the cloud thanks to the fabulous cloud features for redundancy, replication, and backup.
About the Author
You May Also Like