Charting the Data Lake: How models can support the data scientist

Senior Technical Staff Member, IBM Analytics

Perhaps one the single most significant changes to the analytics landscape in recent years had been the emergence of the data scientist. This role is continuing to evolve, with many organizations still in the process of establishing how best to incorporate this relatively new discipline into their ongoing operations. The attitude towards such newly-formed data scientist teams will depend a lot on the inherent culture of the organization and on the skill set and ability of the personnel in these data scientist teams. Factors influencing the role of data scientist teams in an organization can include:  

  • Levels of control needed/perceived to be needed – for some organizations it is acceptable to give such teams the freedom and space to carry out their analytical activities with little initial oversight. In other organizations there is a need to enforce significant controls from the beginning.
  • Level of executive support – like any other new function, the degree of C-level support for the data scientist team is critical to shaping the initial evolution.
  • IT and/or business led – whether the data science function will predominantly be led and influenced by IT or by the business.
  • Attitude to governance – whether there is a strong need and support for an effective governance process in the organization.
  • What is the organisation trying to achieve – what are the overall strategic goals and how the data science function is seen as contributing to these goals.

Typical challenges with on-boarding and managing data scientist teams

When trying to initiate and deploy a data scientist team, a number of common challenges can often occur:

  • Actively leveraging the content from data scientists – how to implement the findings of the data scientist in an efficient and business-meaningful way. Also determining how to ensure that the analysis and discovery undertaken by data scientists is actually of relevance to the business.
  • How to get new data scientists to understand the business – it is ideal for the data scientist to have some sort of business understanding. However, in many cases it is not possible (or economical) for organizations to hire data scientists with the deep business knowledge to accompany their statistical and analytical skills
  • Commonality and reuse of data scientist work – as the data scientist function grows there is the inevitable need to ensure that a standard or commonly agreed set of tools, structures and methods evolve to ensure consistency of analysis and avoiding of unnecessary duplication.
  • Less data prep and more data analysis – a common refrain from many organizations is that their data scientists spend far too much time finding, extracting, reformatting and cleaning the necessary data to carry out their analysis. 

IBM and other vendors have been heavily investing in various data scientist infrastructures and products (for example the IBM Watson Data Platform) to provide the necessary support fabric to enable the sharing of analytical insights both between data scientists and between data scientists and other users of the data lake. In addition, there are opportunities for organizations to use artifacts such as the IBM Industry Models to assist in addressing some of the challenges mentioned above.

So what about the role of Industry Models

As described in previous editions of this series of blogs, assets such as the IBM Industry Models can support the creation and maintenance of the data lake in a number of different ways. Two specific uses of such models in assisting with aspects related to the data scientist are:

  • Extending the common business vocabulary for use by the data scientists
  • Providing predefined common data structures

A single overall business vocabulary

From a models perspective, perhaps the single biggest change has been the increase in interest in the availability of business glossaries and taxonomies to assist with providing the business vocabulary that can act as the single reference point across the data lake. This pattern also applies when it comes to the role of data scientists.  

As the diagram above shows, the common vocabulary can also be extended to accommodate descriptions of and mappings to various data scientist-related artifacts including data scientist sandboxes. The actual criteria determining whether data scientist artifacts are actually included in or described by the business vocabulary are potentially different across different organizations. Such criteria include:

  • Whether the sandboxes or other artifacts being created by the data scientist are of a temporary or more permanent nature. Clearly if a data sandbox is only going to be used by a single data scientist for a couple of weeks and then deleted, then there is little or no value to including reference to this in the business vocabulary.
  • Whether there is a single data scientist or a broader team of data scientists. In the case of there being a number of different data scientists, then there is potential benefit in including references to certain artifacts in the business vocabulary to enable a more consistent approach and may assist in reducing repeat work by different data scientists  and data scientist teams.
  • Where there is less experienced staff in the data scientist role, the existence of such data science artifacts in the business vocabulary may assist in increasing the skill set of this staff.  

Another key consideration is the role of the business vocabulary in guiding and informing the data scientist of what artifacts exist across the data lake. Such a business vocabulary can be particularly critical where the data lake is quite extensive and so there is a need to assist data scientists in making them more efficient in identifying the correct components they need to carry out a particular piece of analysis. 

Data scientist hubs

A data scientist sandbox is typically an ad-hoc temporary structure created by data scientists as part of their experimentation and discovery activities and usually is oriented towards the specific business question being addressed at the time. Therefore, such structures are typically not something that would benefit from being generated from a set of cross-enterprise predefined logical data models. The level of flexibility needed by data scientists means they are more likely to simply create such structures on the fly.

However a number of organizations have identified the potential idea of a set of data scientist “hubs” which cluster together all of the data elements needed by data scientists for a particular concept or areas – for example a hub of all of the customer data, or all of the encounter data, or all of the campaign data. These hubs would still conform to a lot of the characteristics of the data scientist sandboxes, typically deployed on HDFS or other non-relational DBMSs, and typically very flat structures with large numbers of repeating groups of data.

In terms of the possible benefits of such an intermediary layer of data scientists hubs, consider the two diagrams below.

In the above diagram the three different sandboxes need to be separately populated with data from the potentially many different sources.

In the case of each sandbox the data scientist team needs to build out the full set of data extraction and data preparation logic needed to ingest the relevant data into the sandbox. Depending on the complexity, variety and location of the different data sources this ingest can take quite a degree of resources. Indeed in many cases it can lead to the amount of resources being spent on the identification, extract and preparation of data being far greater than the resources actually spent on doing the analysis and insight creation – figures of 70-80% of time being spent on overall data ingest activities have been mentioned.

There is also the consideration that there could be a potential for duplication of such data ingest work where different data scientists or data scientist teams are extracting data from the same sources for slightly different purposes.

Now consider this next diagram below, where an intermediary layer of data scientist hubs have been inserted into the landscape.

In this case a set of data scientist “hubs” have been put in place. The intention of such data scientist hubs is to enable the central IT resources to create such common artifacts for use by the various different data scientists. Unlike the more business-issue oriented sandboxes, these hubs are likely to be concept oriented – as in the example above they contain all the likely data elements needed by data scientists for common areas such as Customer, Campaign, Product etc.  Some considerations regarding the implementation of these hubs include :

  • As they would be managed by IT, there is a requirement for sufficient levels of communication and understand between the central IT and data scientist teams to allow IT to provide the specific data in the required format to data scientists on an ongoing and timely basis.
  • The data scientist function needs to be large enough to justify the creation of such a layer of data structures.
  • The existence of new and/or inexperienced data scientist staff with little knowledge of the business would mean that such data scientists hubs would be very helpful in enabling such staff to become more self sufficient and more productive in a shorter time.

There is also the possibility that such data scientist hubs are at least partly derived from the same set of models as used to generate other artifacts across the data lake, so assisting with consistency across the data scientist function and the broader set of activities by the general business user population.

So the existence of such data scientist hubs can relieve the data scientist teams from a lot of the potential data ingest activities and allow them to spend more time on analysis activities.  There is also the added benefit that such hubs would promote a degree of consistency – for example different data scientist teams using Customer data from the same curated source instead of teams using different/overlapping/conflicting/non-curated data with the potential for incorrect or confusing analysis and insights being provided to the business.  

Charting the data lake blogs

This is the final edition in the current series of “charting the data lake” blogs. This series has attempted to describe the various ways in which models artifacts such as the IBM Industry Models can assist with the development and maintenance of data lakes, how they can provide a common vocabulary for all the different aspects of a data lake, and how they can be used to provide consistent structure, where appropriate, to the various data lake components.

This is a constantly changing environment, so we hope to return to these blogs in the not-so-distant future to update this material as further changes occur in the use of models with Hadoop and data lakes.