Data preparation strategies for advanced predictive analytics
Data preparation is an integral step in predictive modeling. Indeed, some practitioners describe data preparation as the most time-consuming and crucial phase in the overall Cross Industry Standard Process for Data Mining (CRISP-DM). More than any other single factor, data quality affects accuracy when executing analytical models.
However, an input data set as retrieved from its source typically contains abnormalities of all kinds, making it unready for analytics. Data preparation transforms the data, enriching it to make it suitable for analytical processing and thus boosting the accuracy of the outcome achieved.
Approaching data preparation from an architectural standpoint
Data preparation may be done in many various ways, and at IBM Insight 2015 I will be presenting a session examining data preparation from an architectural standpoint involving IBM offerings, particularly those dealing with predictive maintenance and quality. Depending on anomalies in the data and on the overall approach to analytics taken, data preparation can be accomplished using either analytical tools or specialized extract, transform and load (ETL) tools. If analytical models are executed in the context of an end-to-end transactional system, the same runtime or middleware environment can even be used to perform data transformations, much as when using an enterprise service bus (ESB) approach.
Each approach has its own pros and cons, as well as its own applicability criteria—for instance, analytical tools might not be able to handle large volumes. Our choice of an approach also depends on whether we are dealing with a true big data scenario involving large data sets or merely with conventional data sets of reasonable size and volume.
In a typical big data scenario, the veracity factor accounts for uncertainty in data quality. However, dealing with such a large data set may be a challenge for analytical tools not designed for the purpose. Indeed, we might even have to consider some kind of divide-and-conquer approach. For example, in situations involving Hadoop, MapReduce-style processing can prepare the data for subsequent analytics. In situations not involving Hadoop, we might consider other—proprietary—forms of the same approach. Regardless, though, we take batch-style dataset processing as a given.
Data preparation is also needed for real-time invocation, as well as for scenarios involving analytics on in-flight data, and it can be accomplished by means of either rules or specific computation operating on an incoming event or piece of data. On-the-fly ingestion and data processing for scoring involves different challenges than are involved in batch data analytics and training. Specifically, streaming, or on-the-fly, analytics presupposes a rate that outpaces volume, meaning that data preparation is needed only for the incoming event.
Things such as resolving outliers and creating features are among the notably complex steps of data preparation. Resolution of outliers needs statistical processing as opposed to simple data transformations—such as removing nulls and fixing timestamp patterns—and thus is ideally done as part of exploratory data analysis, which should be construed as part of data preparation when necessary. Such data preparation cannot be done using simple transformation tools such as those involving ETL or ESB mediations.
In my Insight 2015 presentation, I will share architectural perspectives on doing data preparation in a variety of ways, highlighting IBM Predictive Maintenance and Quality and focusing on various considerations that then apply. In addition, I will discuss examples drawn from customer engagements. Register to attend IBM Insight 2015—25–29 October in Las Vegas—and then use IBM Analytics to accelerate your career journey into advanced analytics and data science.