Insight and analysis on the data center space from industry thought leaders.
Take Control of Hadoop with a Data-Centric Approach to Security
Maintaining a data-centric security strategy as you plan and implement big data projects or Hadoop deployments, can neutralize the effects of damaging data breaches and help ensure attackers will glean nothing from attempts to breach Hadoop in the enterprise.
August 6, 2015
Reiner Kappenberger is Global Product Manager for HP Security Voltage.
I’ve often said that Hadoop is the biggest cybercrime bait ever created. Why? Well, in the past, attackers had to gain intricate knowledge of a network and go through a lot of work and expense to find the data they wanted to retrieve. In a Hadoop environment, an organization consolidates all of its information into a single destination, making it very easy for criminals to find all the information they want - and more.
It isn’t just the size of the bait that makes Hadoop breaches so treacherous. Hadoop environments are inexpensive to replicate and require no prior knowledge of the data schema used. In just a few days, terabytes of data can be siphoned up and replicated elsewhere.
Hadoop is ground zero for the battle between the business and security. The business needs the scalable, low-cost Hadoop infrastructure so it can take analytics to the next level—a prospect with myriad efficiency and revenue implications. Yet Hadoop includes few safeguards, leaving it to enterprises to add a security layer.
Security cannot afford to lose this fight: Implementing Hadoop without robust security in place takes risk to a whole new level. But armed with good information and a few best practices, security leaders can put an end to the standoff.
With a data-centric security strategy as you plan and implement big data projects or Hadoop deployments, you can neutralize the effects of damaging data breaches and help ensure attackers will glean nothing from attempts to breach Hadoop in the enterprise.
What do I mean by data-centric? Data exists in three basic ways - at rest, in use, and in motion. The data-centric approach is in contrast to traditional network-based approaches to security, which haven’t responded directly to the emerging need for security that neutralizes the effects of a breach through protection of sensitive data at the field-level.
With data-centric security, sensitive field-level data elements are replaced with usable, but de-identified, equivalents that retain their format, behavior and meaning. This means you modify only the sensitive data elements so they are no longer real values, and thus are no longer sensitive, but they still look like legitimate data.
The format-preserving approach can be used with both structured and semi-structured data. This is also called “end-to-end data protection” and provides an enterprise-wide solution for data protection that extends into Hadoop and beyond that environment. This protected form of the data can then be used in subsequent applications, analytic engines, data transfers and data stores.
A major benefit is that a majority of analytics can be performed on de-identified data protected with data-centric techniques – data scientists do not need access to live payment cards, protected health information (PHI) or personally identifiable information (PII) in order to achieve the needed business insights.
Whether you take advantage of commercially available security solutions, or develop your own proprietary approach, the following five steps will help you to identify what needs protecting so you can apply the right techniques to protect it—before you put Hadoop into production.
Audit and Understand your Hadoop Data
To get started, take inventory of all the data you intend to store in your Hadoop environment. You’ll need to know what’s going in so you can identify and rank the sensitivity of that data. It may seem like a daunting task, but attackers can take your data quickly and sort it at their leisure. If they are willing to put in the time to find what you have, you should be too.
Perform Threat Modeling on Sensitive Data
The goal of threat modeling is to identify the potential vulnerabilities of at-risk data and to know how the data could be used against you if stolen. This step can be simple. For example, we know that personally identifiable information always has a high black market value. But assessing data vulnerability isn’t always so straightforward. Date of birth may not seem like a sensitive value alone, but when combined with a zip code, a date of birth gives criminals a lot more to go on. Be aware of how various data can be combined for corrupt purposes.
Identify the Business-Critical Values Within Sensitive Data
It’s no good to make the data secure if the security tactic also removes its business value. You’ll need to know if data has a characteristic that is critical for downstream business processes. For example, certain digits in a credit card number are critical to identifying the issuing bank, while other digits have no value beyond the transaction. By identifying the digits you need to retain, you can be sure to use data masking and encryption techniques that make re-identification possible.
Apply Tokenization and Format-Preserving Encryption
You’ll need to use one of these techniques to protect any data that requires re-identification. While there are other techniques for obscuring data, these are particularly suited for Hadoop because they do not result in collisions that prevent you from analyzing data. Each technique has different use cases; expect to use both, depending on the characteristics of the data being de-identified. Format-preserving technologies enable the majority of your analytics to be performed directly on the de-identified data, securing data-in-motion and data-in-use.
With Hadoop, you must protect sensitive data before it is ingested. Once data enters Hadoop it is immediately replicated inside your cluster, making it impossible to protect after the fact. By applying your tokenization and format-preserving encryption on data during the ingestion process, you’ll ensure no traces of vulnerable data are floating around your environment.
Provide Data-At-Rest Encryption Throughout the Hadoop Cluster
As just mentioned, Hadoop data is immediately replicated on entering the environment, which means you’ll be unable to trace where it’s gone. When hard drives age out of the system and need replacing, encryption of data-at-rest means you won’t have to worry about what could be found on a discarded drive once it has left your control. This step is often overlooked because it’s not a standard feature offered by Hadoop vendors.
The perfect time to undertake this process is after you’ve done a pilot and before you’ve put anything into production. If you’ve done the pre-work, you’ll understand your queries, and adding the format-preserving encryption and tokenization to the relevant fields can be done very easily, taking just a few days to create a proof of concept.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.
About the Author
You May Also Like