Facebook Turned Off Entire Data Center to Test Resiliency

Entire facility, tens of megawatts, shut down for a day

Yevgeniy Sverdlik, Former Editor-in-Chief

September 15, 2014

2 Min Read
Facebook Turned Off Entire Data Center to Test Resiliency
Jay Parikh, global head of engineering at Facebook, speaking at the @Scale conference in San Francisco.

A few months ago, Facebook added a whole new dimension to the idea of an infrastructure stress test. The company shut down one of its data centers in its entirety to see how the safeguards it had put in place for such incidents performed in action.

Jay Parikh, global head of engineering at Facebook, talked about the exercise in his keynote presentation at the company’s @Scale conference in San Francisco Monday.

“This is not a small thing,” he said. “This is tens of megawatts of power that basically we turned off for an entire day to test how our systems were going to actually respond.”

He didn’t specify which of Facebook’s data centers was shut down. It has its own facilities in Oregon, Iowa, North Carolina and Sweden, and leases wholesale data center space in California and Virginia.

The company did run some “fire drills” prior to the test to prepare, and while there were skeptics that the team would actually pull the plug, it was important that it did happen. “We turned the entire region off,” Parikh said.

And the prep work paid off. “It was actually pretty boring for us,” he said.

Not everything worked 100 percent, and the team did put some improvements on the roadmap. But the overall system persevered, and the applications stayed up, and Parikh’s team is planning to continue such stress tests.

An exercise like this falls into one of key tenets of engineering at Facebook, which is embracing failure, Parikh said. Facebook encourages its engineers to take big risks – without being reckless – and doesn’t punish those who take them and fail.

“We don’t squash those,” Parikh said. There are precautions taken to minimize the consequences of failure, and the team spends a lot of energy on analyzing causes of failure and being able to recover quickly.

Read more about:

Meta Facebook

About the Author

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like