Lessons From This Year's Data Center Outages: Focus On the Fundamentals
Most downtime incidents in the past year resulted from known causes and were preventable via robust design and processes.
November 27, 2018
Even as new threats to uptime continued to emerge in 2018, known causes led to most data center outages we saw this year.
According to survey results released by the Uptime Institute this summer, nearly one-third of all data centers had an outage in the past year, up from 25 percent the year before. But the increase wasn't due to some deadly new strain of malware.
Instead, the top three causes of downtime were power outages (at 33 percent), network failures (at 30 percent), and IT or software errors (at 28 percent).
Most importantly, 80 percent of data center managers said their most recent outage was preventable.
You can't prevent lightning strikes (such as the one that took down a Microsoft Azure data center in San Antonio in September) or zero-day malware attacks. But with proper planning and data center design the outages that result from unexpected weather events, attacks, routine human error, or unpatched systems the effects can be minimized.
Just as important is getting a data center up and running quickly after an outage occurs. According to this year’s report by Information Technology Intelligence Consulting, an hour of downtime on average costs data center operators $260,000, while a five-minute outage costs just $2,600.
Infrastructure Redundancy Still Works
At the most basic level, data center systems need backup. Backup for power and cooling systems, backup for data, even backup for the entire data center.
And backups work. According to Uptime, of data centers that had 2N architectures for cooling and power – in other words, a fully redundant, mirrored system – 22 percent experienced an outage last year. That's one-third fewer outages than experienced by those who opted for the cheaper, not-fully-redundant N+1 approach, 33 percent of whom reported outages.
A backup of the full data center makes for even greater reliability. According to Uptime, 40 percent of data center managers said they replicate workloads and data across two or more sites.
"If you have a single data center and there's a lightning strike, you go down," Markku Rossi, CTO at SSH Communications Security, said. "You should have a secondary data center, with physical isolation between them, so they don't rely on the same energy source."
No data center is immune to this problem, he added, pointing to September’s at Microsoft's South Central US data center.
"Have a second setup and have failover happen instantly," he said.
Whatever backup system is in place, planning and testing is key, Rossi added, and that planning needs to account for the complex nature of today's data centers, where problems can trigger other problems.
He used a recent GitHub outage that happened during physical maintenance as an example. "They fixed the physical problem in a few minutes, but it took 24 hours to get the data correctly synchronized," he said.
Data center managers need to pinpoint potential problem areas, then have tools and processes ready to go if something happens.
"Focus on building the processes, building the mindset that you need to prepare for failures," Rossi said.
Harden the Center; Not Just the Perimeter
One of the biggest lessons data centers managers should have learned from the most recent malware-related outages is that it's no longer enough to have a hardened perimeter. Attackers will get through.
Healthcare companies, government agencies, educational institutions, and major manufacturers were all hit this year, even though everyone should have already been on high alert after last year's record-setting breaches.
Obviously, keeping defenses up to date to prevent the malware getting in in the first place is crucial. But data center managers have to be prepared to see their perimeter defenses fail and have secondary levels of protection in place.
Those include malicious-traffi detection mechanisms, network defenses such as segmentation, and a least-privilege approach to access and communications.
These could help keep malware from spreading once it gets into a network or at least slow it down enough to give security teams a chance to respond, Igor Livshitz, director of product management at GuardiCore, an Israel-based cybersecurity firm, said.
WannaCry in particular took advantage of an exploit in the Server Message Block transport protocol. Data centers should do more to reduce lateral communications, he said.
"In many cases of the WannaCy ransomware in the past year, the main driver for the attack's wide effect was the fact that once these worms got a foothold within the data center, they could easily proliferate," Livshitz said. "In fact, the SMB traffic between the servers wasn't necessary at all. If it would have been blocked, the spread of the attack and damage to the data center could have been massively reduced and the attack detected in an earlier stage before it inflicted such substantial damage."
The big lesson from this year's breaches isn't that there's a big new threat out there that data center managers have to defend against. It's that that they need to go back to the basics.
Nearly all data center downtime is due to bad planning and investment decisions, coupled with poor processes or failure to follow processes, Andy Lawrence, executive director of research at Uptime Institute, wrote in his June report. "Almost all failures reported or researched by Uptime Institute have happened before and are often well documented."
Lightning strikes and new types of malware might grab all the headlines, but it's the basics that still matter most when it comes to resilience.
About the Author
You May Also Like