Blogs

Scale Up Data Integration with Data Models and Inferencing

Learn how data modeling can drive integration using automated inferencing

Big data is often defined as the ability to derive new insights from data that has scaled up along three axes known as the three v’s: volume, velocity, and variety. Among these axes, recent industry surveys indicate that many organizations find big data variety to be more challenging than either its volume or velocity.

The problem of integrating data that conforms to different models is nearly as old as punched cards. Today, the increasing amount of data sources becoming available—whether from legacy systems or from emerging sources such as the Internet of Things—means that large-scale data processing can benefit greatly from techniques that accommodate variety at large scale. For example, these techniques can enhance the use of tools such as the Apache Hadoop framework.

When doing data integration, programmers can always write code to convert data from one source to align better with data from another source. If, on the other hand, relationships can be defined between data from different sources using standards-based models, these models can drive the integration, often with no need for new transformation code.

Scaling data models with the data

Data modeling systems where developers create diagrams and then generate Java code from those are nice, but as the code evolves over time, the model diagrams become little more than outdated system documentation. When a system uses standards-based, machine-readable models to implement integrations, the system’s use of the models can scale up with its use of the data, allowing it to handle an increasing variety of data sources. And, this technique can be done at a very large scale when using a Hadoop cluster.

The simplicity of the data model described by the World Wide Web Consortium (W3C) Resource Description Framework (RDF) standard1 makes it well suited for representing instance data combined from disparate sources, including data drawn from relational databases, spreadsheets, and other non-RDF sources. Using a tool such as TopQuadrant TopBraid Composer, developers can use the W3C’s RDF Schema (RDFS) language2 to create a data model that describes relationships between classes and properties—or tables and columns—from different data sources. Then, they can use commercial and open-source RDFS inferencing engines to automate the mapping of instance data to conform to the integration model.

Automated inferencing can be used for complex tasks, but simple versions of it can make great contributions to mundane but common business problems—especially data integration. For example, if employee 943 stored in one database has a last_name value of “Smith” and this last_name column is modeled as a subproperty of an enterprise canonical data model’s familyName property, then the system can infer that employee 943 has a familyName value of “Smith” from the canonical model’s perspective. Later, if staff member X43B7 from a different database with different column names has a LastName value of “Jones” and this LastName property from that database is also modeled as a subproperty of the canonical model’s familyName property, the same tool can infer that staff member X43B7 from the second database has a familyName value of “Jones.” This automated inferencing integrates data from an additional source, not by writing additional transformation code to handle the new source, but by simply adding a new subproperty relationship to the canonical data model.

Driving integration

Tools are available to make this kind of integration possible at increasingly large scales. For example, a short Python script running on a Hadoop cluster can read these models, read the instance data from different sources, and then infer the canonical data model versions of that data for use by other applications.3

The standards and technology to perform this conversion on individual machines have been around for years. The extent to which RDF-related standards and Hadoop can support each other to tackle both old and new problems efficiently and at a large scale is great to see.

Please share any thoughts or questions in the comments.

1 Resource Description Framework (RDF), World Wide Web Consortium website.
2 RDF Schema (RDFS) 1.1, World Wide Web Consortium website.
3 See the blog, “Driving Hadoop Data Integration with Standards-Based Models Instead of Code,” by Bob DuCharme, February 2015. It walks through the setup and execution of one of these conversions and then shows, after later adding a bit more to the model, how the system accommodates a broader range of input data. And it does so without any new scripting in Python or any other scripting language beyond the revision of the model.