Architecting Big Data Solutions: Part I Hadoop and IBM Netezza
Big Data is increasingly becoming a part of the enterprise IT vernacular and we are seeing it rapidly move through the hype cycle as a viable value creation opportunity for enterprises. In a recent report, McKinsey estimated this value to be in the order of billions of dollars and deemed it the "next frontier for innovation, competition and productivity".
The unprecedented growth and availability of data across a diverse set of channels and the competitive advantage that organizations gain from harnessing that data are the key driving factors for big data adoption.
A recent TDWI report on big data analytics summarized findings across 325 respondents representing diverse industries, geographies and company sizes. About 70% of the respondents considered big data an opportunity for their respective organizations.
About two thirds of the respondents report using their current EDW for managing and operating big data for advanced analytics and would prefer to continue using their EDW in future for all their big data needs. Also, about a quarter of the respondents indicated that they are currently using Hadoop in some capacity and a third of them suggested that it would be a prominent component of their big data analytics platform in future.
Based on the survey results it is likely for one to perceive that organizations have to make an either-or choice between their RDBMS based EDWs and Hadoop for managing and performing big data analytics. However, that is not necessarily the case. In practice we find that organizations are really expanding their EDWs to accommodate Hadoop for managing a subset of its workload.
This was corroborated by a recent Ventana research study indicating that Hadoop is generally additive and is "supplementing other established technologies, with RDMSs still the dominant technology being used or planned to be used by more than nine out of ten organizations". We have also seen evidences from Yahoo and Facebook, two of the biggest proponents of Hadoop, about their use of relational data warehouses in conjunction with their Hadoop clusters for big data analytics.
Hadoop's ability to run on commodity servers, store a broad range of data types, process analytic queries via MapReduce and predictably scale with increased data volumes are very attractive solution characteristics as it pertains to big data analytics. RDBMS based EDW solutions - such as Netezza appliances - enable low latency access to high volumes of data, provide data retrieval via SQL, integrate with a wide variety of enterprise BI and ETL tools and are optimized for price/performance across a diverse set of workloads. Organizations that architect their big data platforms integrating the two technologies have the ability to take advantage of the best of both worlds. Some typical examples of where one would use Hadoop within an EDW context are –
- Exploratory analysis – There are occasions when enterprises encounter new sources of data that needs to be analyzed. Say for example, your marketing department launched a new multi-channel campaign and wants to integrate user responses from Facebook and Twitter with other sources of data they may have. If you haven’t used Facebook and Twitter API or are not familiar with their data feed structures, it might take some experimentation to figure out what to extract from that and how to integrate it with other sources. Hadoop’s ability to process data feeds, for which schema has not yet been defined, serves as an excellent tool for this purpose. So if you want to explore relationships within data, especially in an environment where schema is constantly evolving, Hadoop provides a mechanism by which one can explore the data until a formal, repeatable ETL process is defined.
- Queryable archive – Big data analytics tends to bring large volumes of data under purview. Often times a significant percentage of this data may not be something that one accesses on a regular basis. They may either contain historical information or granular level information that has subsequently been summarized within the EDW. Putting all this data within an infrastructure primarily optimized for price/performance may not be economically viable. One may want to store the less frequently accessed information on infrastructure that is optimized for price/terabyte of storage and move it to the high performance infrastructure on-demand. Hadoop’s fault tolerant distributed storage system that runs on commodity hardware could serve as a repository for that information. Unlike tape based storage systems that have no computational capability, Hadoop provides a mechanism to access and analyze data. Since moving computation is cheaper than moving data, Hadoop’s architecture is better suited as a queryable archive for big data.
- Unstructured data analysis – Recent studies have shown that many enterprises believe that the amount of unstructured data that they would have to analyze is growing very fast and in some situations could soon outpace the amount of structured data. Common examples of unstructured data analysis are to glean user sentiment from the company’s Twitter feed or gain insights embedded in customer phone conversations with their support personnel. RDBMS based data warehouses provide limited capabilities in storing complex data types represented by unstructured data. Also, performing computations on unstructured data via SQL can be quite cumbersome. Hadoop’s ability to store data in any format and analyze that using a procedural programming paradigm, such as MapReduce, makes it well suited for storing, managing and processing unstructured data. In an EDW context, one can use Hadoop to pre-process unstructured data, extract key features and metadata and load that into an RDBMS data warehouse for further analysis.
We had discussed some of these uses cases in more detail in a previous blog post. As the big data space matures, we are seeing multiple technology vendors make pertinent announcements. Many of the RDBMS based data warehousing vendors, including Netezza, offer connectivity solutions to Hadoop systems. We are also seeing multiple Hadoop distributions and solutions emerge in the marketplace. However, assuming that a point-solution would address all aspects of big data analytics would be a fallacy.
Big data analytics is a computational discipline and one would need to skillfully architect multiple technologies to meet its broad objectives. It is disruptive in nature and would pose architectural challenges to IT organizations similar in scale as to SOA in the late 90’s and cloud computing over the last decade. Organizations that overcome those challenges and use the right set of technologies for big data analytics will win.
Learn more at HadoopWorld 2011 when Krishnan speaks with Edmunds.com during the Hadoop and Netezza Deployment Models session http://bit.ly/r0YNqT