A Framework that Focuses on the Data in Big Data Governance

Big data types, information governance disciplines, industries, and functions

Founder and Managing Partner, Information Asset, LLC

Big data governance is part of a broader information governance program that formulates policy relating to the optimization, privacy, and monetization of big data by aligning the objectives of multiple functions. However, big data governance is meaningless without an understanding of the underlying data types.

Figure 1. A three-dimensional framework for big data governance

  This article provides a framework for big data governance. As shown in Figure 1, the framework consists of three dimensions:

  • Big data typesBig data can be classified into five types: web and social media, machine-to-machine (M2M), big transaction data, biometrics, and human-generated.
  • Information governance disciplinesThe traditional disciplines of information governance—organization, metadata, privacy, data quality, business process integration, master data integration, and information lifecycle management—also apply to big data. For example, sensor data needs to be integrated into a preventive maintenance process. However, if sensors from different machines generate inconsistent event codes, it will be difficult to streamline the maintenance process.
  • Industries and functions Big data analytics are driven by use cases that are specific to a given industry or function such as marketing, customer service, information security, or information technology.

As mentioned above, big data falls into five categories:

  1. Web and social media data includes clickstream and interaction data from social media such as Facebook, Twitter, LinkedIn, and blogs.
  2. Machine-to-machine data includes readings from sensors, meters, and other devices as part of the so-called “Internet of things.”
  3. Big transaction data includes healthcare claims, telecommunications call detail records (CDRs), and utility billing records that are increasingly available in semi-structured and unstructured formats.
  4. Biometric data includes fingerprints, genetics, handwriting, retinal scans, and similar types of data.
  5. Human-generated data includes vast quantities of unstructured and semi-structured data such as call center agents’ notes, voice recordings, email, paper documents, surveys, and electronic medical records.

A big data framework looks different depending on industry and function.

Healthcare providers

Solution:          Patient monitoring Big data type:  M2M data Disciplines:     Data quality, information lifecycle management, privacy A hospital leveraged streaming analytics technologies to monitor the health of newborn babies in the neonatal intensive care unit. Using these technologies, the hospital was able to predict the onset of disease a full 24 hours before any symptoms appeared. These technologies depended on large volumes of time series data—but this data was sometimes missing when a patient moved, which caused the lead to disengage and stop providing readings. In these situations, the streaming platform used linear and polynomial regressions to use historical readings to fill in the gaps in the time series data. The hospital also tagged all time series data that had been modified by software algorithms. In case of a lawsuit or medical inquiry, the hospital felt that it had to produce both the original and modified readings. Plus, the hospital established policies around safeguarding protected health information. Solution:          Predictive modeling based on electronic medical records Big data type: Human-generated data Discipline:       Data quality The analytics department at a hospital built a predictive model based on 150 variables and 20,000 patient encounters to determine the likelihood that a patient would be readmitted within 30 days of treatment for congestive heart failure. In one example of the predictive model’s effectiveness, the analytics team identified the patient’s smoking status as a critical variable. At first, only 25 percent of the structured data around smoking status was populated with binary yes/no answers. However, the analytics team increased the population rate for smoking status to 85 percent of the encounters by using content analytics based on electronic medical records containing doctor’s notes, discharge summaries, and patient physicals—enabling the analytics team to improve the quality of sparsely populated structured data by using unstructured data sources.

Health plans

Solution:          Claims analytics Big data type: Big transaction data Discipline:       Data quality A large health plan processes over 500 million claims per year, with each claims record consisting of 600 to 1,000 attributes. The plan uses predictive analytics to determine whether certain proactive measures were required for a small subset of members. However, the business intelligence team found that physicians were using inconsistent procedure codes to submit claims, which limited the effectiveness of the predictive analytics. The business intelligence team also questioned the text within claims documents. For example, the team used terms such as "chronic congestion" and "blood-sugar monitoring" to determine that those members might be candidates for disease management programs for asthma and diabetes, respectively.


Solution: Smart meters Big data type: M2M data Discipline: Privacy, information lifecycle management Several utilities are rolling out smart meters to measure the consumption of water, gas, and electricity at regular intervals of one hour or less. These smart meters generate copious amounts of interval data that need to be governed appropriately. Utilities must safeguard the privacy of this interval data because it can potentially reveal a subscriber’s household activities as well as when a homeowner might be away. In addition, utilities need to establish policies for the archival and deletion of interval data to reduce storage costs.


Solution: Facebook loyalty app Big data type: Web and social media Discipline: Privacy, master data integration, organization A retailer’s marketing department might want to use master data on customers, products, employees, and store locations to enrich its Facebook app. The success of the Facebook app depends on a strong foundation of master data management (MDM) and policies around social media governance. In one example, the retailer would need to adhere to the Facebook Platform Policies by not using data on a customer’s friends outside of the context of the app, as marketing and social media stewards have agreed on a consistent set of identifiers to link a customer’s Facebook profile with his or her MDM record. Finally, the retailer needs to establish a robust product hierarchy to enable product comparisons. For instance, the retailer would need to know that a customer who purchased a “Whirlpool GX5FHDXVY” already has a product in the “refrigerator” hierarchy. Solution: Personalized messaging based on facial recognition and social media Big data type: Web and social media, biometrics Discipline: Privacy, business process integration A March 2012 report from the U.S. Federal Trade Commission details how retailers could potentially use facial recognition technology in combination with a photo from social media to make personalized offers to customers based on their buying behavior and location. While this information could have a tremendous impact on retailers’ loyalty programs, it would also have serious privacy ramifications. Retailers would need to make the appropriate privacy disclosures before implementing these applications.


Solution: Customer churn analytics Big data type: Web and social media, big transaction data Discipline: Privacy, master data integration Telecommunications operators build detailed customer churn models that include social media and big transaction data such as CDRs. However, the overall value of the churn models also depends on the quality of traditional attributes of customer master data such as date of birth, gender, location, and income. A large operator wanted to implement a predictive analytics strategy around churn management. Analyzing subscribers’ calling patterns has proven to be an effective way to predict churn, so the operator decided that it would outsource its churn analytics to an overseas vendor. Because these CDRs had to be shipped to the vendor each day, there was significant concern over safeguarding the privacy of customer data. After the appropriate deliberation, the operator decided to mask sensitive data such as subscriber name because the calling and receiving telephone numbers were the primary fields of value for churn analytics.


Solution: Claims investigation, underwriting Big data type: Web and social media Discipline: Privacy, business process integration Many insurance carriers now use social media to investigate claims. However, most regulators still do not permit insurers to use social media to set policy rates during the underwriting process. For example, if a life insurer sees that an applicant’s Facebook profile indicates that she is a student pilot, the insurer cannot use that knowledge to increase her premiums because she might be considered a high risk. Solution: Vehicle telematics Big data type: M2M data Discipline: Information lifecycle management An insurer instituted a pilot program that offered lower rates to policyholders in exchange for the ability to put on-board sensors on motor vehicles. These sensors gathered telematics data to monitor the driving behavior of policyholders. Overwhelmed with a large amount of data, the insurer had to establish a policy regarding the retention period for telematics data.


Solution: Risk management Big data type: Web and social media (web content) Discipline: Master data integration Risk management departments need to update their customer hierarchies, all of which depend on the most current financial information. For example, when Tata Motors acquired Jaguar, the risk management department had to update the risk hierarchy for Tata Motors to also include any exposure to Jaguar. In another example, a bank developed an economic hierarchy to aggregate its overall exposure to a car manufacturer, its tier 1 and tier 2 suppliers, and the employees of the manufacturer and its suppliers. The risk management department could update its economic hierarchy in the event of consolidation between suppliers, or use big data technologies to comb through unstructured financial information such as U.S. Securities and Exchange Commission 10K and 10Q filings to dynamically update changes in company ownership structures within its MDM hierarchies. Solution: Credit, collections Big data type: Web and social media Discipline: Privacy Banks follow regulations such as the United States Fair Credit Reporting Act when using social media for credit decisions. In addition, collections departments must adhere to regulations such as the United States Fair Debt Collection Practices Act, which are designed to prevent collectors from harassing debtors or infringing upon their privacy, including within social media.


Solution: Preventive maintenance Big data type: M2M data Discipline: Data quality, information lifecycle management, business process integration, master data integration, metadata Sensors on a modern train record more than 1,000 different types of mechanical and electrical events. These include operational events such as “opening door” or “train is braking,” warning events such as “line voltage frequency is out of range” or “compression is low in compressor X,” and failure events such as “pantograph is out of order” or “inverter lockout.” The preventive maintenance team uses predictive models to identify events that are highly correlated with preceding events. Consider an example where failure event 1245 is preceded by warning event 2389 90 percent of the time. In this example, the operations team must issue a work order for preventive maintenance whenever warning event 2389 is logged into the system. If the railroad has trains in its fleet from different manufacturers, sensors on different trains might generate different numerical codes for the same event. If a particular part failed on one train, the operations department might want to inspect similar parts on other trains, which would be difficult if the same part has different names across trains. Retention of sensor data that is driven by safety regulations is another consideration.

Customer service

Solution: Call monitoring Big data type: Human-generated Discipline: Privacy Customer service departments analyze voice recordings to improve operational efficiency and to support agent training. Before using this data, customer service departments should mask the portions of the voice recordings that contain sensitive information such as social security number, account number, name, and address.

Information technology

Solution: Log analytics Big data type: M2M data Discipline: Metadata IT departments are turning to big data to analyze application logs for slivers of insight that can improve system performance. Because application vendors’ log files are in different formats, they need to be standardized before IT departments can use them.


Solution: Sentiment analysis Big data type: Web and social media Discipline: Master data integration, data quality, privacy Marketing departments use Twitter feeds to conduct sentiment analysis that helps an organization determine what users are saying about the company and its products or services—for example, the analytics team needs to determine if references to “@Acme” and “Acme” refer to “Acme Corporation.” Integration of sentiment analysis with a customer’s profile can also be challenging, because in addition to privacy issues, the Twitter handle reveals the user name only in 50 to 60 percent of cases. Plus, marketing might need to answer the following question: “Do we really believe that Twitter sentiment analysis is representative if users are younger and more affluent than our usual customers?”

Information security

Solution: Network analytics Big data type: M2M data Discipline: Metadata Security Information and Event Management (SIEM) tools aggregate log data from systems, applications, network elements, and security devices across the enterprise. It is highly likely that the log files from two network elements will refer to the same event using different codes. Security professionals need to normalize these event codes before using SIEM analytics. Organizations will be successful in governing their big data if they adopt a framework that covers the appropriate types of big data, the information governance disciplines, and the specific use cases for their industry and function.

[followbutton username='IBMdatamag' count='false' lang='en' theme='light']