Using Metrics to Vanquish the Fail Whale

Measuring and analyzing performance data has been the primary weapon in Twitter's ongoing effort to vanquish the "Fail Whale" - the downtime mascot that appears whenever the service is unavailable, according to Twitter's John Adams.

Rich Miller

June 23, 2009

2 Min Read
DataCenterKnowledge logo in a gray background | DataCenterKnowledge

johnadams

John Adams of the Twitter ops team discusses the use of metrics to improve web site performance at Velocity 2009 (Photo by James Duncan Davidson via Flickr)

Few prominent web sites have failed more often and under closer scrutiny than Twitter. But over the past year the microblogging service has rehabilitated its reputation, improving its uptime even as its traffic has grown phenomenally.

That torrid growth continues, despite reports to the contrary based on ComScore data, according to John Adams of Twitter's operations team, who spoke this morning at the O'Reilly Velocity Conference in San Jose. "There are a lot of reports that our growth is slowing down," said Adams. "I can't say what the real numbers are. But it's just not slowing down at all. All that traffic has led to an insane amount of pain."

Measuring and analyzing performance data has been the primary weapon in Twitter's ongoing effort to vanquish the "Fail Whale" - the downtime mascot that appears whenever Twitter is unavailable.

"You really want to instrument everything you have," Adams told an audience of 700 operations professionals. "The best thing you can do is have more information about your system. We've built a process around using these metrics to make decisions. We use science. The way we find the weakest point in our infrastructure is by collecting metrics and making graphs out of them."

Those metrics are aggregated in a "Lord of the Rings" dashboard ("One dashboard to rule them all") that brings together more than 1,200 data points for staff to track and analyze. That includes data from Twitter's in-house monitoring as well as data center and network services provider, NTT America, and Google analytics. Interestingly, one of the most useful data points from Google Analytics is the "Fail Whale" page, which includes analytics code to track error data.

The appearance of the Fail Whale indicates a server error known as a 503, which then triggers a "Whale Watcher" script that prompts a review of the last 100,000 lines of server logs to sort out what has happened. When at all possible, Twitter tries to adapt by slowing the site performance as an alternative to a 503, according to Adams, who uses "whale" as a verb. "Our general fail mode has been to delay rather than whale," he said. "We hate whale."

Adams said the focus on metrics is one of the ways Twitter has matured. "In the beginning we had a lot of cowboy stuff, a lot of changes going on without control," he said. "We've got a handle on that."

In offering advice to other site operators, Adams cited the importance of an off-site status page to keep suers infromaed about problems. Twitter has a status blog on Tumblr. Adams says keeping users informed can reduce "armchair engineering."

"And we've definitely been a victm of that," he said.

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like