Google: Big Data Shouldn’t Be an Infrastructure Project

Brings service that replaced MapReduce into beta, enhances BigQuery performance, availability, security

Yevgeniy Sverdlik, Former Editor-in-Chief

April 16, 2015

2 Min Read
Urs Hölzle, Senior Vice President for Technical Infrastructure at Google, speaks during the Google I/O 2014 conference in San Francisco
Urs Hölzle, Senior Vice President for Technical Infrastructure at Google, speaks during the Google I/O 2014 conference in San FranciscoStephen Lam/Getty Images

Google unleashed a slew of improvements to its cloud big data offerings, including the beta launch of Dataflow, the service based on technology that replaced MapReduce in its own analytics engine.

Google and its competitors are offering users sophisticated analytics without having to build and manage infrastructure. Both Amazon Web Services and Microsoft Azure have cloud big data offerings, and so do IBM and HP, among others.

All of these companies “wrote the book” on building and managing data center infrastructure at massive scale, and Google’s message is, “Don’t try to do it yourself. Let us handle it.”

“For example, you might be collecting a deluge of information and then correlating, enriching and attempting to extract real-time insights,” William Vambenepe, a Google product manager, wrote in a blog post. “Should you expect such feats, by their very nature, to involve a large amount of resource management and system administration? You shouldn’t. Not in the cloud.”

Urs Hölzle, senior vice president of technical infrastructure at Google, made a big splash last June when he said the company had stopped using MapReduce and replaced it with Dataflow, which could do both batch and stream processing for big data applications. MapReduce has for years been the go-to framework for batch processing and underpinned Hadoop, the most popular framework for storing data on server clusters and running parallel processing jobs for analytics on that data.

MapReduce hasn’t been an inseparable part of Hadoop ever since Hadoop 2 was released. The framework can now use a variety of processing models.

In the beta version of Google Dataflow, the user can choose between batch and stream processing modes. The service starts the resources needed to run the user’s program, scales them automatically (user defines the limit to which resources can scale), and terminates them when the work is done.

The company also added new features to BigQuery, its API-driven SQL analytics service and extended its availability to European data centers. The new features include better security and performance.

Read more about:

Google Alphabet

About the Author

Subscribe to the Data Center Knowledge Newsletter
Get analysis and expert insight on the latest in data center business and technology delivered to your inbox daily.

You May Also Like