Understanding data ownership in the data lake
There is so much talk about data as a new natural resource. The amount of data organizations and citizens across the globe produce, is authored in many systems and consumed by various organizations and users in different formats. This begs the following questions: Who owns this data? And why it is important? And how is the quality of data linked to the assignment of accountability?
Data ownership and social data
Depending on the organizational point of view, different ownership rules may apply in different situations to data.It is a tricky part when we deal with Data ownership while using external sources, especially if we use social data which is an essential element, as we build our cases for front office digitization, customer sensitive analysis and so on. While we deal with tremendous amount of social data which describes the interactions of people, moving this data around, changing it for our analysis, at the end have a difficult question – who owns this social data? Is this the real authentic data that was truly originated from a person and has some valid purpose? Or it is modified, changed and became fake or misleading, which if used and analyzed can lead us to unreliable and wrong decisions.
Let’s look at famous Hulu litigation case, ongoing since 2011, where they had to defend themselves against claims regarding disclosure of user information to third parties, such as Facebook and com Score. The claim was that ‘the Like button was configured so that it transmitted the titles of the videos users watched to Facebook’s servers regardless of whether the user clicked the Like button indicating that the user liked the clip.’ In today’s world, many companies are using social media for advertisement. And this is a great example of risks that companies may face if they don’t understand the information they disclose to third parties, and should explore methods for obtaining consent, to avoid risks and associated huge penalties.
New technology, connected ecosystems and continuously learning AI systems are dealing with more and more data, which can provide us with suggested decisions that can potentially influence our future. Information sources and their accuracy are the key when we collect the information, and goes back to the questions ‘How the ownership of the information is handled?’ How is data collected and why? Who owns the data – particularly personal data? What are the sources and where did they come from? What is the value of the insight given, how it is used and who owns it?
Organizations need leaders to drive data responsibly
The role of a Chief Data Officer (CDO) has become more and more important in this aspect. A CDO defines the logical business object models, the governance rules around that, and is ultimately responsible for the quality of the data and its ownership. The CDO is a business executive appointed by the board to create a step-change in the way that the organization manages and uses data, and take over ownership of the information governance program from the CIO. This results in a major shift as the key buyer for information governance technology is no more a CIO, an IT capability buyer, but the CDO, who is a business solution buyer. This influences the greater focus on targeted action and automation within these tools from the business perspective but not at a cost of innovation. The CDO collaborates with stakeholders to get their feedback on the impact of the governance procedures, policies and standards, as well issues and exemptions from the rules where they are not appropriate.
And this is where the data lake aspect comes to play to enable information ownership capabilities. The data lake reference architecture includes the notion of data-centric security. This uses the business objects and classifications from the data lake’s catalog to map between access rights expressed in terms of the subject areas supported by the data lake, and the data stored in the data lake’s repositories. Thus, data owner/CDO will define the logical business object models, the governance rules around them, and be ultimately responsible for the quality of the data.
Data lake needs data owners
The data landscape needs to be divided into subject areas and a business leader needs to be appointed as the owner for each subject area. The business leader of a subject area needs to work with the subject matter experts to define the glossary of core business objects, attributes and relationships within the subject area, the details of the valid values for the attributes and any validation rules. These definitions are stored as metadata. Classification schemes are then created to characterize both data and the assets related to its use and management as follows:
- The business classification schemes define the confidentiality, integrity and criticality of the data.
- The subject area definitions and classification schemes together create a glossary of terms that are used to express the governance rules.
- The governance rules describe how the policies will be implemented. Policies and associated rules can be defined to support internal corporate policy, legislation and regulations that apply in a specific country or industry or information management best practices.
The data lake stores data from many different processes. This data is distributed and duplicated amongst the data lake repositories. When data from a system is copied into the data lake as raw data, the system owner of the source owns that data. They are responsible for its quality and management. The subject area owner is responsible for approving access to data about their subject area. When an individual needs access to data from a subject area, they (or their manager) initiates a workflow to request access. The workflow notifies the appropriate subject area owner about the request along with details of the individual, the level of access they require and a justification for the access request.
With data comes great responsibility
Data lake provides data curation for security and data protection. Information is owned with an individual responsible for the appropriate management and governance of each information collection. If a system and related business processes create and manage the data, then it is the systems owners’ responsibility to appropriately supply the data. Provide your own data not the other system data. Otherwise, simply because a system has collected or has access to a set of data does not make them the appropriate provider.
Hear how IBM and Concord have been working together to help you govern your data lake responsibly.