Skip navigation
Blade server install

UK Emergency Planners' Data Center Risk Info Gathering Is Slow Going

The government's disaster mitigation team has found it difficult to get data center operators' pandemic-related outage risk assessments.

UK government officials charged with assisting data center operators running essential digital services during the COVID-19 pandemic haven't been able to get all the data they need as quickly as expected. While they have provided a lot of the data being sought, data center operators haven't been responsive to requests for their assessments of fault risk as a result of the lockdowns.

Already operating with cut staff, postponed maintenance, and stalled supply chains, operators have braced themselves for intense competition over scarce resources, as the country steps carefully out of lockdown, according to a brief published this week by techUK, the IT trade association. "Heavy competition" would become a "significant challenge," it said.

Money in some cases has started looking tight as well, with customers asking to delay payments for computing services, and with large companies expected to be doing less overall, as the country dips into recession.

Government has asked for data center industry risk assessments repeatedly since March, so it could marshall resources to make sure the national data infrastructure keeps running when the country needs it most. This was part of a government-wide effort to make sure critical national industries kept operating.

Operators have complied with other requests the Department of Culture, Media and Sport's dedicated COVID-19 disaster mitigation team had made for information -- notably, what support the data center sector needs from the government, what other sectors are vital to its operation, and testimony on the part it plays in delivering the nation's vital internet services.

Meanwhile, DCMS has been gathering some of the missing information from operators in private conversations it’s been holding with each of them. That's how it will have to be done, since the critical information is otherwise not forthcoming, Emma Fryer, techUK associate director, said. Government and industry won’t know the industry's risk of failure until their private meetings are concluded.

"While most of the work items gathering information for DCMS were complete or near complete, the data on faults and risks was coming in very slowly," according to techUK minutes of a 1 May teleconference with DCMS. "Fault and risk data to be revisited after conclusion of one-to-one meetings."

Data center operators lobbied the government to treat them as part of the critical national infrastructure in March, so they could keep running without adhering to the most stringent "lockdown" controls the government imposed on the country to stop the coronavirus spreading, techUK reported last month. But they asked not to be designated as critical infrastructure formally, so they could avoid being made to comply with stringent regulations. 

The entire economy ultimately depended on their operations, and it would suffer the ripple effect of their failure if the worst happened, the industry told the government on 13 March.

The DCMS disaster planning team wants the information for a government-wide plan to make sure critical national industries keep operating. It wants to identify critical problems it might solve, like getting customs officers to help manage constrained supplies of vital data center equipment. 

The team wants to understand what effect the coronavirus is having on the industry's ability to withstand faults and disasters. It asked for the information again, formally, on 9 April, and again on 17 April.

techUK called in help from other industry bodies when operators were silent. Ed Ansett, executive committee member at one of them, the Data Center Incident Reporting Network, told DCMS in a techUK teleconference on 24 April that the risk of catastrophe will increase, because routine maintenance has stopped, and stocks of components are running short.

Imposed staff shortages meanwhile increase the risk of problems running out of control, especially as summer approaches, and as rising temperatures increase the risk of computer failure.

DCIRN was formed in 2017 to gather information about risks and faults from the secretive industry after the May 2017 data center outage that grounded 672 flights and 75,000 passengers of British Airways and disrupted the UK's two major airports on a busy holiday weekend. 

DCIRN said at the time that the data center industry ought to follow the example of the airline and nuclear industries in mitigating against disasters by ensuring complete transparency when mistakes happen, instead of its "common practice to cover up disasters in a misguided attempt to protect reputations."

DCMS and techUK members first agreed to "explore risks to functionality and identify where government intervention might be necessary" in a teleconference on 3 April, according to minutes techUK published of the meeting. IBM had drafted a forecast of ways the COVID-19 lockdown would stress industry. techUK said operators would submit their own such risk planning.

Operators considered the risks with DCMS in another teleconference on 9 April, and worked through a list of questions DCMS had submitted. techUK said questions on faults and supply chain had been prioritised. It published this formal list. It planned a dossier of answers that would be updated weekly.

techUK published a draft fault risk report On 20 April, but with just one entry, which it entered itself as a "starter for ten." The report was otherwise empty, indicating that it awaited submissions.

"We did not get many responses to our call for typical faults," techUK said on 27 April. "So we called on DCIRN and also have been speaking to Uptime [Institute]: both organisations have useful stats." Uptime assesses data centers' resilience to faults as part of an accreditation regime intended to inform customers. It also regularly surveys its network of members about their downtime incidents.

Fryer's 27 April report included a diagram of data center faults gathered by Uptime six years ago. The diagram was from a report for Uptime that said "lack of transparency can be seen as a root cause of outages and incidents" in the data center industry. "Placing blame takes a back seat to identifying root causes and preventing recurrence," said its author, Jason Weckworth, then senior VP and chief operating officer at RagingWire Data Centers.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish