Masked vs. unmasked data: what should be used in lower environments?
Data security is critical to business survival, and data masking is a key component. Some enterprises protect data by simply redacting sensitive values; others employ fake data or write their own homegrown tools or scripts to mask certain environments.
Masked vs. Unmasked Data: Why Mask It?
We need to protect data in a way that keeps sensitive information safe and preserves our capacity to test effectively, maintain rapid feature delivery, and draw business insights from data.
That's why masked data is important.
Steering Clear of Data Mismatch
In our app-driven world, there are always concerns over test coverage, testing velocity, and tester productivity. When you test against datasets that originate from different points in time or you run a second test without rolling data back, you create the conditions for data mismatch.
Suppose you have dataset A from January 1 and dataset B from February 1. Even if both datasets A and B have “good” test data, they can yield bad test results because dataset B has changed. The same can happen if you run a test again against a “dirty” dataset because it costs way too much to reset the dataset or to re-mask it. Test failures are notoriously elusive to match and correct because the state of the data is fluid or its characteristics no longer match the original dataset.
Data’s Slippery Slope
Well-masked production data preserves existing implicit data relationships across the enterprise. For example (just as with your real data) a newborn probably shouldn’t have an AARP enrollment date. In addition, making sure key data elements mask the same way in disparate systems is crucial for testing. If you mask system A and convert Jason to Bob, but in system B you convert Jason to Chris, matching up test results becomes difficult and labor-intensive.
We could say this is not a problem for unstructured data, but the reality is that unstructured data almost always requires validation or is exchanged with applications that do have structured data. In most cases, production data has solved the implicit relationship problems at scale as well as validated that implausible or impossible values don’t exist.
When you mask data consistently, you can preserve implicit data relationships and beneficial data characteristics without complications. Other approaches, however, have to intuit or declare these data rules for every system that’s added. It can become very expensive to declare all the possible rules in order for the data to reflect the characteristics that it ought, and that expense often rises in relationship to the number of systems tested together.
Uncovering the Business Value of Masked Data over Unmasked Data
Data masking, also referred to as de-identification or data obfuscation, is a method of protecting sensitive data by replacing the original value with a fictitious but realistic equivalent that is valuable to testers, developers, and data scientists.
But what makes masked data both protected and useful? While there are plenty of masking technologies in the market, here are the 3 must-have characteristics of a solution that will preserve the data’s business value and ultimately enable quicker testing and better insights for the enterprise.
- Referentially consistent – It maintains referential integrity at scale (within and across applications) in a manner consistent with the unmasked data, so that you find real errors instead of building new data that does have integrity or chasing ghost errors caused by mismatched data.
- Representative – It shares many characteristics with the original cleartext, including producing similar test results and data patterns that can yield insight.
- Realistic – The values in masked fields are fictitious but plausible, meaning that the values reflect real life scenarios and the relationships are consistent, but the referent (e.g., the customer whose name you retrieved) doesn’t exist. (So realistic, in fact, that a data thief wouldn’t even know that the data is masked just by looking.)
On a more general note, masking is typically:
Irreversible – The original protected data is not recoverable from the masked data.
Repeatable – It can be done again (and on command) as data and metadata evolve.
Masking Insights: Revealed and Analyzed by the Delphix Experts
How are you protecting sensitive data in non-production environments? In our recent State of Data Compliance and Security Report, 66% cited use of static data masking. Discover other masking insights, including how to use masking for data compliance — without making trade-offs for quality or speed!
Watch the on-demand webinar to find out!
Making Data Your Innovation Superpower
When the cycle to get freshly masked data takes too long, you slow down development by forcing teams to work with stale data, impacting productivity and producing ghost errors. Even if you can get masked data fast, when masked data is not referentially consistent or synchronous, ghost errors still exist. Similarly, insight on masked data can suffer if the process doesn’t maintain distributed referential consistency. If you’re forced to redact it with dummy values, then it may perfectly be “valid” but worthless for insight.
In short, masked data with business value yields better testing and better insights on the data. With the right solution, you can bring enormous business value by accelerating velocity (both in the feature delivery pipeline and masked data delivery pipeline), deliver a lower cost of change and error, and make business insights readily available all while implementing the data protection you need.
Six Key Masking Capabilities that Drive Faster Feature Pipelines
There are six key capabilities a data masking solution must have:
- Rapidly reproduce synchronous, high-fidelity copies of multiple datasets in an on-demand library. Crazy and elusive time and referential integrity errors just dissolve as you can provision heterogeneous datasets (masked or unmasked) from the same point in time in just a few minutes.
- Marries virtual data with masked data. That means you don’t have to go through 20 steps and wait 3 days to get your masked data; masked data is always available and ready to deploy at a moment’s notice.
- Maintains automated synchronicity with masking. This makes it possible for disparate, geographically distinct and even air-gapped datasets to all be masked the same way. That means you can be referentially consistent within and across systems.
- Built for scale. Got 12 systems and many datasets totaling 200 TB? No problem, you can virtualize and mask the whole dataset collection and, once ingested, deliver the entire collection in minutes.
- Uses a policy-based masking approach. A policy-based obfuscation technique uses the domain of the data and metadata itself to decide how to mask. Combine this with masking in memory, and suddenly instead of fixing 40+ end points, there’s just one. Change management is radically simpler. Consequently, launching masking on a DevOps data platform like Delphix typically takes 80% less time than with traditional solutions. More importantly, it takes 99% less time when masking the second time.
- Dead simple to know and recover the state of 1 or more large dataset(s). Typically, the cost of error falls into the 10 minutes or less range to recover. Ask yourself: How long would it take you to recover an environment with a mix of Oracle, SQL Server, and non-traditional data sources if someone accidentally ran a test? My bet is that number is in the days or weeks category, if it’s possible at all.
How Masked Data Improves DevOps
What’s the value of these capabilities to your software feature factory? Here’s how well-masked data impacts four key DevOps metrics:
Deployment Frequency
This isn’t how fast we deploy data. It’s the change in how often we can deploy/release code because we have better, more secure data faster. Deployment frequency is a function of stability and deploy time. With the ability to rapidly reproduce high-fidelity copies and with fresh versions of consistent masked data always at the ready, you create an island of data stability. Similarly, a significant portion of deploy time is taken up by test time, which in turn relies in large measure on the time it takes to get the right masked data.
Lead Time for Changes
Lead time is a function of delivery pipeline length, complexity, and volume. From a data perspective, this shows up as data scalability (can I get 3 masked 5TB test datasets?), agility (how fast can I move 3 datasets from box A to box B?), and data transformation challenges (can I automatically and quickly mask the data as change occurs?). Having collections of consistent, masked heterogeneous datasets ready at a moment’s notice creates enormous velocity. Changing a masking rule in one central location instead of at each possible end-point makes change management much faster. Having both of these superpowers creates opportunities to remove steps from the pipeline, reduce unnecessary controls on data, and standardize how environments are built. The result: reduced variability and greater velocity.
Time to Restore
Size, the need for fresh data, and the number of datasets to recover typically make this task bigger and more difficult as it scales up. But the right data management capabilities change that. First, data recovery is not a function of dataset size, with typical times being under 10 minutes for one or more datasets even as the datasets scale. Second, the recency of masked data has nothing to do with the speed of deployment. That is, the most recent masked data is always ready to deploy in that same 10 minutes, and a new set of masked data is repeatedly and frequently available.
Change Failure Rate
On average, defects occur at a rate of from 1 to 25 per 1000 lines of code, and data problems account for about 15% of those defects. It’s hard to make big datasets or collections of big datasets consistent and up to date. Thus, there is often a tradeoff between testing with the best data and finishing on time. But, the rapid delivery of data at scale, with rapid reset, can make that tradeoff evaporate. With right masking solution, everyone (yes, everyone) can test with the right data in the same timeframe as their build. Teams get the best data without sacrificing speed using this approach. For example, Fannie Mae reduced their data-related defect rate from an estimated 15-20% to less than 5%.
Masked Data with Delphix
Delphix delivers data masking capabilities that enable businesses to mitigate risk and eliminate barriers to fast innovation. Delphix automatically discovers sensitive data values including names, email addresses, and payment information. Then, it transforms sensitive values into realistic, yet fictitious ones — while retaining referential integrity.
Related blog >> What Is Delphix?
Comply with Privacy Laws and Protect Against Breach
With Delphix, teams centrally define masking policies and deploy them across the enterprise for compliance with key privacy regulations such as GDPR, CCPA, HIPAA, and PCI DSS. And because masking transforms sensitive information, Delphix neutralizes risk of breach in non-production environments that contain vast amounts of data that must be protected from cyberthreats. A recent IDC study found that 77.2% more data and data environments were masked and protected by using Delphix.
Integrate Data Masking and Data Delivery
The Delphix DevOps Data Platform combines data masking with virtualization to deliver compliant data to downstream environments for development, testing, analytics, and AI. Masked, virtual data copies function like physical copies; but they take up a fraction of the storage space and can be automatically delivered in just minutes.
Get Started with Data Masking
Try Delphix data masking and see how Delphix enables fast, automated compliance. Request a no-pressure compliance demo today. You’ll find out why industry leaders choose Delphix to mitigate data risks and accelerate innovation.
This blog was originally published in two parts on September 15, 2019 and December 2, 2019. These parts have since been consolidated into one blog for comprehensiveness.