IBM Cloud Outage Causes Disruptions: Learning from the Failure

IBM Cloud suffered a multi-zone outage impacting its services. Here are steps and strategies organizations should take to limit cloud outage risk.

Sean Michael Kerner, Contributor

June 11, 2020

4 Min Read
cloudy skies
Getty Images

IBM Cloud suffered a major outage on June 9, disrupting much of the company's cloud services for hours and impacting an unknown number of customers.

IBM's Cloud status page provides limited insight into the IBM Cloud outage that spanned multiple hours across North America, Europe and Asia Pacific regions.

"IBM is focused on external network provider issues as the cause of the disruption of IBM Cloud services [June 9]," an IBM spokesperson told ITPro Today. "All services have been restored, and a detailed review is underway to determine the specific cause and address issues that led to the outage."

Network monitoring firm ThousandEyes was also tracking the IBM Cloud outage, reporting that issues first began to appear at approximately 5:50 p.m. EDT, with a network-wide outage that impacted the reachability of services hosted by the cloud provider. ThousandEyes analysis found that the outage lasted around two hours, during which time massive packet loss was observed across infrastructure operated by SoftLayer Technologies, which IBM acquired in 2013.

"The massive, global nature of this network outage suggests a routing or other control-plane issue, as opposed to a router failure or fiber cut, which would have contained the outage to a smaller blast radius," Angelique Medina, director of product marketing at ThousandEyes, told ITPro Today.

Cloud Outages Are Infrequent, but They Do Occur

The IBM Cloud outage was not the first time a cloud provider has had a disruption of service. Google Cloud Platform suffered a large outage in June 2019, though Medina noted that Google's outage was somewhat different from IBM's.

"Google’s massive network outage in June of last year underscored the importance of redundancy, as some regions and availability zones [AZs] were impacted, while others were not," she said. "In the case of IBM Cloud’s outage [June 9], the network outage was global, so implementing region and AZ redundancy would not have been enough."

That said, Medina suggests that a hybrid or multicloud architecture can mitigate some of the risks from a global, cloud-wide outage, which, though very uncommon, can occur.

Cooper Lutz, cloud architect at AHEAD noted that outages aren't exclusive to the public cloud, as anyone who has hosted their own technology can attest to. Lutz suggests that by focusing on how to plan for, respond to and solve for outages, organizations can mitigate outage risk and reduce downtime.

For Sarah Terry, director of product management at LogicMonitor, the IBM Cloud outage is the latest in a long line of highly publicized IT outages that impact enterprises' abilities to maintain service levels and employee productivity.

"Cloud technologies provide a wide variety of benefits and flexibility to businesses, but also increase IT infrastructure complexity and risk for many enterprise organizations," Terry told ITPro Today. "Quite simply, more moving parts mean an increased surface area for performance or availability issues."

In September 2019, LogicMonitor conducted research on the impact of IT outages worldwide, discovering that 96% of global IT decision-makers had experienced an outage in the last three years, with 51% of those outages deemed avoidable. The most common causes of downtime include network failure, spikes and surges in usage, software malfunction and failed infrastructure, Terry said.

Best Practices for Mitigating Cloud Outage Risks

There are a number best practices that organizations can put into place to minimize the impact of a public cloud outage. For LogicMonitor's Terry, monitoring is a good place to start. In general, cloud users can obtain visibility into outages and therefore be able to react by ensuring that their monitoring doesn't solely rely on the cloud provider itself, she said.

"All cloud monitoring relies on cloud APIs to some extent, but platforms that use an additional outside-in approach to check availability of cloud resources enable teams to identify issues even when the cloud provider itself is unable to report them," Terry said.

For Matt Wallace, CTO of Faction, a multicloud data services company, the key to minimizing risk is spreading workloads across multiple cloud providers.

"Many customers look at multicloud as a defense against cloud service provider failures," Wallace told ITPro Today. "While we believe that access to innovation across clouds is the most compelling driver for multicloud adoptions, having the ability to practically move a workload from one cloud to another is an incentive that many enterprises will be mindful of immediately after an outage."

Resilience in Cloud Deployment

No one strategy is truly sufficient to effectively mitigate the risk of an outage. Forrester analyst Dave Bartoletti noted that there is a need for companies to use several strategies at once to limit the impact of cloud outages:

  • Where possible, duplicate data across three regions for higher availability.

  • Distribute public network-facing apps across at least two different geographic regions. Implement load balancing at all levels of app and database design.

  • Regularly test failure scenarios.

"The public cloud platforms have tremendous overall reliability and availability, but outages definitely happen — the key is to design apps and infrastructure for maximum resiliency," Bartoletti said.

About the Author

Sean Michael Kerner

Contributor

Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He consults to industry and media organizations on technology issues.

https://www.linkedin.com/in/seanmkerner/

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like