Ensembles to Boost Machine Learning Effectiveness

Employing ensemble-based crowdsourcing can confidently identify best-fit models

Big Data Evangelist, IBM

Advanced analytics is all about using statistical analysis to find nonobvious patterns in the data. But, for the data scientist, it's never 100 percent obvious that any particular machine-learning model has found the precise pattern for which you're searching.

However, if multiple independent models—all using the same data set but trained on different samples, employing different algorithms, and calling out different variables—converge on a common statistical pattern, having greater confidence in what they’re collectively revealing is possible. Until there is sufficient confidence among the alternative models, what might be the best-fit model for any given analytics problem domain is not entirely clear.

That confidence challenge is essentially the rationale for what’s called ensemble learning in statistical analysis. Ensemble learning is a well-established data science methodology that can be characterized as crowdsourcing for machines.1 Fundamentally, this approach requires a supervised learning algorithm known as an ensemble, which brokers convergence among the results of multiple independent models. The ensemble algorithm accomplishes this brokering by training on the result sets of the independent models and using averaging, bagging, boosting, voting, and other convergence functions to reduce variance among the patterns revealed in the various constituent models.2

Core operational responsibility

Ensemble-based crowdsourcing for machines has many practical applications. Next-best action—the heart of decision automation and recommendation engines—rides on the best-fit model.3 Quite often, so do real-world experimentation and A/B testing. Notably, Kaggle competitions have been won by ensembles of independent decision-tree models.4 And then there are the computational sciences—for example, physics, econometrics, and so on—in which ensemble methods support independent verification of findings across distinct models developed by different researchers using different algorithms and approaches.5

Building and tweaking the ensemble models that determine best fit is a core operational responsibility of operational data scientists. For every scenario being optimized through this approach, there should be two or more alternate independent models, with an inline ensemble model brokering among them. With ensemble-based machine learning, there is always a champion—best-fit—model for each application scenario in production. In addition, there are one or more challenger models ready to be promoted to production, if the champion’s predictive power decays as determined by the controlling ensemble model.

To help determine what is the best-fit model at any time, the ensemble should continue to score the result sets of all these models—champion and challenger(s)—against continuous feeds of fresh information from applications, enterprise data warehouses, Apache Hadoop clusters, and other sources.

Crowdsourced-identified expertise

But in any scenario, the best-fit model may be in the head of a human expert who’s evaluating the results of ensemble models and of the current champion models they’ve promoted to production status. That situation ties into the critical notion of the next-best expert.6 Where machine learning is concerned, crowdsourcing should incorporate some means for identifying the next-best expert—whether that expert is a human, machine, and/or specific machine-learning model—suited to the analytic challenge at hand. It all comes down to who or what can deliver the best and fastest of available answers to the big data analytics challenges.

Please share any thoughts or questions in the comments.

1A Thumbnail History of Ensemble Methods,” by Mike Bowles, Revolutions blog, March 2014.
2Ensemble Methods in Machine Learning,” by Thomas G. Dietterich, Oregon State University, Corvallis, Oregon, fall 2007.
3Next-Best Action Rides the Best-Fit Model,” by James Kobielus, The Big Data & Analytics Hub blog, November 2012.
4Crowdsourcing Big Data Creativity? Smarts Trump Big Data, Sophisticated Algorithms, and Domain Expertise (Part 2),” LinkedIn, Big Data Integration group, October 2013.
5Data Scientist Skill Sets? Robust Replication Across Independent Data and Scientists,” LinkedIn, Big Data Integration group, July 2013.
6Next-Best Expert: Collaboration of People and Machines on Big Data and Analytics,” by James Kobielus, The Big Data & Analytics Hub blog, January 2014.