Hadoop: The open refinery of 21st century data science
Some people might say that Apache Hadoop is showing its age. Now that Apache Spark has eclipsed Hadoop on the next-big-thing front, the Hadoop community may appear to be groping for relevance in this new era of in-memory, streaming, real-time big data analytics. But that perception could not be further from the truth.
As noted in a blog covering day three of the recent Hadoop Summit 2015 event in San Jose, California, the Hadoop market is thriving and innovation remains strong. Moreover, many Spark deployments depend on core Hadoop infrastructure for data discovery, preparation, enhancement, storage and governance. Essentially, Hadoop is the open source information refinery platform upon which the data science revolution depends.
Looking ahead to the Strata+Hadoop World event in New York, New York, the conference agenda and the expected turnout clearly show that Hadoop remains the foundational platform of modern data science. Many professional data scientists rely on Strata training sessions to refresh their skills in this rapidly changing discipline. Not surprisingly, Strata training sessions such as “Practical Data Science on Hadoop” presented by IBM colleagues Chris Fregly, Mokhtar Kandil, Brandon MacKenzie, John Rollins and Jacques Roy were sold out weeks in advance.
Data science platform
In the 21st century, Hadoop is the foundation of the open source analytics operating systems that Beth Smith, general manager of IBM analytics platforms at IBM, discussed at the recent Spark Summit 2015 event. Data scientists may be doing more of their iterative data science modeling in Spark, but the bulk of the data science workflow is expected to continue to depend intimately on access to Hadoop clusters.
No matter what front-end data science workbench they employ, developers can continue to tap into unstructured information from Hadoop-based data lakes. Data scientists won’t begin to build their Spark models until the foundation data has been collected, staged, cleansed, enhanced and otherwise prepared in Hadoop-based information refineries. They won’t trust the quality of that data unless it’s being managed in Hadoop-based data governance platforms. And they’ll store much of that data to Hadoop-based archives in public, private and hybrid cloud-based storage.
Extract-transform-load (ETL) operations on unstructured data have long been a sweet spot application for Hadoop. As discussed in a LinkedIn blog on unstructured ETL, Hadoop has always been as much a data integration technology as it is a platform for doing high-powered data science. Large-scale data munging—converting or mapping data from one form to another that offers expedient consumption—is something hardly ever seen in a Spark project unless a massively parallel Hadoop platform is cranking away back in the cloud.
When deployed as a data integration layer, Hadoop platforms such as IBM InfoSphere BigInsights execute MapReduce models for unstructured ETL, while also managing the bilateral movement of data sets between source repositories and downstream databases and applications. In support of these functions, you can expect to see next-generation Hadoop solutions that offer the following features:
- Enhanced multistructured data quality through metadata management, data profiling capabilities and massively scalable, shared-nothing, in-memory and real-time data integration
- Supported definition of policies and rules that pinpoint multistructured data that is of keen interest to data science application development projects
- An automated process for ensuring that multistructured data is being used correctly
Responsibility for managing these back-end, data-munging capabilities is expected to migrate to a new generation of Hadoop data engineers. Without these professionals’ steadfast attention to this critical platform and function, the data science workflow would grind to a halt, lacking the quality data that developers of Spark and other analytics models need to extract mind-blowing insights.
And on that note, be sure to see the upcoming keynote by Jeff Jonas at Strata+Hadoop World in New York, New York. Jonas immediately comes to mind when the discussion turns to visionaries and their understanding of how Hadoop-based data science is driving innovation in the 21st century.