What's Causing Cloud Outages? A Network Managers' Guide

From fat-finger errors to fishing boats, here are the leading reasons cloud outages at AWS, Microsoft, and others are a growing network resilience challenge.

Salvatore Salamone, Managing editor

August 7, 2023

2 Min Read
finger pressing "cloud" button
Alamy

As enterprises rely more and more on cloud services to meet their network infrastructure, compute, data storage, and security needs, cloud computing outages have a significant impact on operations.

Many believe (or hope?) that moving services to the cloud would eliminate some issues. After all, you would assume cloud providers make use of the latest technologies, have staff with expertise in these technologies, and build in lots of redundancy.

Unfortunately, what we find is that cloud outages have a lot in common with their data center outage counterparts. Many occur due to human error, power outages, malicious acts, Mother Nature, or plain bad luck. 

What's Causing Cloud Outages?

There are several common culprits causing cloud outages. Over the last few years, we have seen examples of each. All have had a significant impact on the enterprises using the services. Here are some of the top problems that keep reoccurring.

Configuration mistakes

We're in the age of graphical user interfaces (GUIs) and automation. Yet, many critical IT chores like deploying a new server, provisioning storage for an application, or setting up new router tables are done manually via command line interfaces (CLIs). As one would expect, that can lead to configuration mistakes.

Related:Surviving Cloud Outages: What You Need to Know

That is often the case with cloud outages. One such mistake caused a six-hour outage of Facebook, Instagram, Messenger, Whatsapp, and OculusVR due to a routing protocol configuration issue. As we wrote at that time: "The outage was the result of a misconfiguration of Facebook's server computers, preventing external computers and mobile devices from connecting to the Domain Name System (DNS) and finding Facebook, Instagram, and Whatsapp."

Essentially, BGP routers were unrecognized, preventing traffic destined for Facebook networks from being routed properly. Resolution of the problem was more challenging than normal because not only was communication between routers interrupted, but so too, were DNS traffic and all applications.

The problem here was that everything ran over the same network. As a result, IT staff could not remotely correct the problem because they could not access the impacted systems. And making matters worse, IT staff were locked out of facilities because their access control system also ran over the same network.

Read the rest of this article on Network Computing.

About the Author

Salvatore Salamone

Managing editor, Network Computing

Salvatore Salamone is the managing editor of Network Computing. He has worked as a writer and editor covering business, technology and science; written three business technology books; and served as an editor at IT industry publications including Network World, Byte, Bio-IT World, Data Communications, LAN Times and InternetWeek.

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like