Accelerating unstructured data compliance with a new approach: sampling

WW Program Director, Offering Management, Information Governance and eDiscovery, IBM

We’re in an age of data protection regulations and with renewed focus on compliance and privacy. Businesses need to assess the various data sources they have across the enterprise in order to find out if sensitive information is present in areas where it is not allowed. If they don’t, regulators could find unprotected sensitive data and fine or sanction them. Or a data breach could include this data and it would be identified and exposed to the world.  

The problem, however, is how to review billions of documents, petabytes of data, all in a reasonable period of time and for a reasonable amount of money. The cost of a data breach involving more than 50 million personal records stood at an estimated $388 million in 2019.Accelerating unstructured data compliance with a new approach: sampling

Today’s approaches typically include indexing every document and searching for personal or sensitive data with pattern-matching or machine learning algorithms. While this approach can be accurate, it can also be expensive and time-consuming, with projects often taking months or even years before all data is assessed – discouraging some businesses to not even start. 

It is time to approach the issue differently. Instead of indexing every file, it is much more efficient to make use of statistical sampling technology to look at a random assortment of files. The initial goal of sampling is to assess where the areas of highest risk are within your enterprise-wide body of data. And to do this, you can utilize a type of random sampling. That is, you can assume that there may be potential compliance issues out there, we just want to figure out where and how extensive they are. Once you’ve identified where issues exist, we’ll want to do a more comprehensive index, looking at every file in the area identified as a hot spot and remediating the issues. 

Finally, once we have a hypothesis that a particular area is “clean” we’ll want to do another sampling pass – this time we can make use of negative sampling to confirm, with a given confidence level, that the vast majority of issues have been remediated.

By taking this approach, you can not only prioritize our clean-up efforts, but also find many areas that you can be fairly confident don’t have policy violations and eliminate the need for comprehensive indexing. I believe this approach is a game-changer and will lead to an incredible reduction in the amount of time it takes to assess your unstructured data environment for policy compliance.

IBM Watson Knowledge Catalog InstaScan is a new product that combines statistical sampling with unstructured data management for cloud data sources. It can help you reduce time-to-value and help you accelerate your journey towards regulatory compliance readiness. 

To learn more how you can accelerate unstructured data compliance at your organization, join our webinar on September 24.