Two things before I begin:
- I’ll begin this posting with a call for inputs. Below I will list a few of the most common Hadoop/Netezza co-existence deployment patterns we have seen to date. But I would like to hear from others. As you see the continuing deployment of Hadoop in the enterprise and as the Second Wave of TwinFin™ comes on with the advanced analytics capabilities of i-Class, how do you see the evolving deployment patterns happening in your environment?
- A special hat-tip to Krishnan Parasuraman, Netezza’s Chief Architect for our Digital Media group, for his excellent help in aiding and abetting this post! I have used his guidance gratefully and (with his permission) stolen freely from some of his inputs.
You may have noticed a partnership announcement made by Cloudera and Netezza late last week. Together with Cloudera, Netezza will open up data movement and transformation between Cloudera’s Distribution for Hadoop and the Netezza family of appliances applications and data flows for integration of the two systems. We expect that our partnership with Cloudera, together with the Hadoop support in Netezza’s i-Class™ set of advanced analytics capabilities that are included as part of the upcoming release 6.0 software release, will lead to some very innovative and expansive applications for our customers and for both companies.
Even today, Netezza customers are doing some very interesting things with deployment of Hadoop and our TwinFin data warehouse appliance. Far from being the “Hadoop v. SQL” battle that some people might like to make the current market out to be, we have instead noticed a growing number of “co-existence” deployment strategies and design patterns already at work with our customers – particularly among customers in the “Digital Media” vertical market.
These types of strategies can play to the strengths of both technologies and roughly break down into two categories: 1) the use of a Hadoop Cluster for data ingestion, which I’ll write about in further detail today; and 2) using a Hadoop Cluster for long-term data retention, or as a “queryable archive,” for which I’ll go into further detail in a post later this week.
Using a Hadoop Cluster for Raw Data Ingestion
The use of a Hadoop Cluster as the engine for data ingestion is the most common “co-existence” pattern we see in our customers’ mutual deployments of Hadoop and Netezza. The deployment pattern typically arises when the customer has hit specific performance and processing throughput scalability limitations with their existing Data Integration or ETL implementation.
Raw weblog data is the primary data source for most Digital Media analytics and reporting requirements. Weblogs are data rich (e.g., page views, impressions, click-throughs and demographics collected from applications servers). They are typically semi-structured and collected and stored in flat files.
There are some critical facts about weblogs that present real performance challenges in processing them:
- sheer volume: millions of rows of weblog data collected throughout the day and loaded daily into the data warehouse;
- complex query processing: parsing and decoding encoded character strings requires text processing, pattern matching, tokenizing type capabilities within the ETL process
- non-conformed dimensions: collecting page views or impression data defined and represented differently by various systems makes fitting them into conformed dimensions is another very common data ingestion & processing challenge.
There are two common variants of this pattern – dealing with semi-structured (e.g., weblogs) and unstructured (e.g., text) data and often customers will have versions of both variants in operation simultaneously.
Semi-structured data ingest via Hadoop
Semi-structured data is parsed (and possibly aggregated as well) in the Hadoop Cluster and then loaded into a TwinFin where the performance and workload scaling of the appliance is important for deeper analysis, higher throughput and faster reporting.
Unstructured data ingest via Hadoop
Unstructured data in this pattern is contextualized (classified, mined, keyworded and indexed) in Hadoop and then moved into a Netezza TwinFin appliance for the low-latency, high-performance analytics used to drive business decisions.
A Hadoop Cluster provides a scalable ingestion mechanism that is well suited for addressing the challenges described above. The Cluster can be incrementally scaled to handle ingesting the massive volumes of weblog data and it can support text processing and complex data processing through programming languages such as Java or Python. [Note that with the coming i-Class set of analytics functionality, the programmability and some of the complex data processing may also be possible on the TwinFin, depending on a customer’s applications needs or preference.]
Following the data ingest steps, processed weblog information is brought into TwinFin as atomic event information or as summarized tables, depending on the size of the appliance and analytic maturity & scale of the organization where it is deployed. A typical deployment might look like the following diagram:
An alternate, far less common, deployment design of the above co-existence pattern is used by some of our customers. That is the use of an external elastic MapReduce cloud (such as the Amazon Cloud) for the data ingestion purposes.
In cases where the customer may have its application servers in the Amazon’s EC2 cluster, they may also choose to use Amazon’s S3 web services for retaining weblog data. In that case, Amazon would provide the elastic MapReduce infrastructure for the data ingest process into the TwinFin appliance. This alternative deployment scenario would look something like the following:
The bottom line is that the different strengths of TwinFin and Hadoop lend themselves to complementary deployments – and some of our customers have already discovered innovative ways to leverage them together to maximize the value of both their investments.
In my next post, I’ll discuss the second pattern we’re noticing: one in which Netezza customers are using the Hadoop Cluster for long-term data retention.