3 steps to effective data classification for business-ready data

Offering Manager, IBM

Global data privacy compliance regulations like the General Data Protection Regulation (GDPR),California Consumer Privacy Act (CCPA) and Brazil’s LGPD have created scrutiny around personal, customer and employee data. This data is growing at a rapid pace, and so are the mandates requiring protection of sensitive and personal data.

As I mentioned in my previous blog post, data is fueling digital transformation. With the ever-increasing growth in data, the opportunity to drive digital transformation is slipping out of reach for those who aren’t able to properly manage their data. Thus, companies need trusted, business-ready data at the speed and scale of the market to help achieve business objectives. To help achieve this business-ready data and prepare for these data privacy requirements, businesses need visibility into the data they’re collecting and storing, in order to determine what’s important and what isn’t.

Data classification is an essential part of a successful data privacy and management strategy. It can help you identify the business value of data and separate valuable information within this data. Implementing an adaptable classification policy as part of an your data privacy strategy can help create an organized system to simplify the process of identifying sensitive data and personal data.

Every business has different data classification needs, and the strategy must be tailored accordingly. The following 3-step action plan can be used to create the foundation of an effective data classification project.

1. Defining the project: Categorizing the type of data

It is important to establish the boundaries of a project in the beginning to keep data classification efforts in control. At this stage, you must consider how granular the classification levels are which they aim to reach. You should also make sure to note anything that’s out of scope and ensure this is evaluated and adjusted regularly.

There are a few key questions you should ask as you define the scope for your data classification project:

  • What data needs to be classified?
  • Who are the contributors?
  • What are the data types?
  • Where does sensitive data live?
  • What are some examples of classification levels such as tags?

It is equally important to communicate the scope of the project to the entire team to achieve the data classification goal. Showing project goals on a dashboard can help in achieving successful classification. IBM StoredIQ provides a modern and rich dashboard to publish project goals and monitor the progress of the project as well as the contribution of each team member.

2. Finding the right training set: Identify and tag data

Data is an essential resource for any machine learning project.The lack of quality labeled data is also one of the largest challenges facing data science teams. Using automation to help operationalize data can help enterprise companies recover billions of hours of worker productivity across all industries.

Once the scope of data is defined, the next task is to identify all the data that requires classification. One technique to achieve this is to filter out the data set as per the data policies defined in the project. The next step is to preview the resulting set to confirm the relevance of documents for the defined project. StoredIQ has a modern user interface that provides preview of a document with highlighting capabilities. Each document can be tagged for creating a knowledge base for the data science team, thereby increasing worker productivity.

3. Building the model: Improving Efficiency

Once you collect sample documents, you can create an information set (infoset) based on the tags defined in the project scope. This model infoset can then be used for active learning. Active learning is semi-supervised machine learning in which an algorithm is able to interactively query the user to provide input to the machine learning process. Humans can learn with just a single example and we are still able to distinguish new objects with very high precision. The machine will recommend the most valuable documents for review, which greatly speeds up the model improvement.

IBM StoredIQ has a capability called Cognitive Data Assessment that uses Active learning for data classification to fuse training and scoring of data into one loop:

IBM StoredIQ Cognitive Data Assessment uses Active learning for data classification.

You can monitor and maintain data classification policies dynamically, as they can greatly assist  in data governance and preparing for data privacy regulations, as well as protecting important business documents. Businesses need to establish a process to review and update policies which involve data scientists and other users to encourage adoption and ensure that the approach continues to meet the changing needs of the business.

Learn more about how you can use IBM StoredIQ Cognitive Data Assessment for your data classification projects.

Notice: Clients are responsible for ensuring their own compliance with various laws and regulations, including the European Union General Data Protection Regulation and the California Consumer Privacy Act (CCPA). Clients are solely responsible for obtaining advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulations that may affect the clients’ business and any actions the clients may need to take to comply with such laws and regulations.  The products, services, and other capabilities described herein are not suitable for all client situations and may have restricted availability. IBM does not provide legal, accounting or auditing advice or represent or warrant that its services or products will ensure that clients are in compliance with any law or regulation.