Hadoop is fundamental to the future of big data. Users are adopting Hadoop for strategic roles in their current data warehousing architectures, such as extract/transform/load (ETL), data staging and preprocessing of unstructured content. Hadoop is also a key technology in next-generation massively parallel data warehouses in the cloud, which will complement today’s data warehousing technologies and complement low-latency stream-computing platforms.
On Wednesday, April 10, I participated in an IBM Twitter chat with analysts, influencers, thought leaders, fellow IBMers and others on this very hot topic. The event took take place from 12-1pm EDT and used hashtag #bigdatamgmt.
Here are highlights from the discussion. Note that I’ve edited the questions that the moderator (the all-knowing @IBMBigData) asked; correlated and edited participants’ tweeted responses for legibility; smoothed out some of the most mystifying Twitterese; and concatenated some people’s individuals’ tweets through ellipsis marks in order to illuminate the whole thought they’d chopped up into 140-character bon mots.
Here now are the high points.
What are the core use cases for Hadoop?
We covered a wide range of use cases in a crisp cluster of detailed tweets.
I spelled out the chief Hadoop use cases: “#bigdata refining/staging, exploration/sandboxing, & unstructured analytics....#Hadoop is for any massively scalable advanced analytics & data integration over unstructured content with in-db execution....#hadoop supports the full range of advanced analytics, from structured/data mining to unstructured/text analytics....Many orgs using #hadoop for hi-performance unstructured ETL. Some now using for next-gen BI.”
Fellow IBMer Leon Katsnelson (@katsnelson) noted that “one thing that we see a lot of interest in Hadoop is for data warehouse augmentation....also see a lot of 360 view of the customer type of usecases..” IBMer Christy Maver (@cdmaver) observed that Hadoop is used for “data warehouse augmentation....can mean archiving cold data, Hadoop as landing zone, exploratory analysis.” She also said that a key Hadoop use case is “combining big data with enterprise data to increase operational efficiency.”
Jeff Kelly (@jeffreyfkelly), a technology market analyst covering big data and business analytics for The Wikibon Project, said Hadoop use cases include “deep, historical analysis of large volumes of multi-structured data....Not as sexy as analytics, but Hadoop used for data warehouse offloading and archiving “cool” or “cold” data.” He also noted that Hadoop is used as a “data ‘lake,’ ETL layer to serve as source for downstream data marts and applications.”
Jeff’s colleague John Furrier (@furrier) said Hadoop is optimized for “ batch jobs but moving fast to custom fit real time streaming data processing...cio’s i talk to are doing POC with hadoop for uses cases never b4 imagined - i call is “loose data” - telematics data.....big data is about “fast and loose” data.. this is the mobile real time world..value is the competitive adv for firms who get it.”
Richard R. Lee (@InfoMgmtExec), an information governance and risk management professional, said “ I like Hadoop for integrating & analyzing streams of disparate data types varying in structure to have that “composite view”....Hadoop helps many EDW’s to have true 360 degree view by adding that unstructured component into structured world.”
Will Hadoop complement or replace existing legacy data warehousing and BI tech?
We all agreed that Hadoop primarily complements existing data warehousing and BI investments.
Jeff Kelly said “ #hadoop very much complementary to DW and business intelligence - nobody’s (or very few) ditching RDBMS completely....#hadoop actually helps you get more return on your EDW investment by brining in new data sources....RDBMS entrenched in the enterprise, not going anywhere soon.”
Richard R. Lee said “Hadoop is part of the evolution of the traditional EDW & BI. Never meant to replace IMO. Fully complementary.... This is not your Daddy’s EDW...A logical evolution.”
I discussed the extent to which Hadoop and EDW will co-evolve, while noting the use cases for which EDW is still preferred. “Hadoop is complementing today’s EDW, serving as ETL and sandboxing layer. Hadoop gradually moving into EDW use cases, eg, BI...Hadoop gradually moving into EDW use cases, eg, BI....Over time, EDW natively incorporating support for more #Hadoop techs, especially MapReduce and HDFS.....The use-case overlaps between #Hadoop and EDW are extensive and growing. But Hdp not yet suitable for governance/DQ & MDM.”
Leon Katsnelson used an historical analogy to discuss how Hadoop and EDW will co-exist for the foreseeable future: “rise of the automobile and the airplane did not kill the train. We are going to see data warehouses flourish with Hadoop....lots of things overlap it does not mean they replace each other. data warehouses and hadoop will be used together.”
What complementary tools, apps, skillsets help get the most value from Hadoop?
I discussed what you need to maximize the return on your Hadoop investment: “#Hadoop requires complementary tools/skills in modeling, visualization, discovery, integration, performance optimization....requires complementary connectors to EDWs, in-memory dbs and other #bigdata platforms....requires complementary model/algorithm libraries--MapReduce etc.--to enable fast modeling & development...requires data science skillsets in its developers. And, as range of MapReduce models grows, needs model governance....needs a complementary SQL access/abstraction layer to support easy development for today’s DB app developers.”
Richard R. Lee stressed that with “so many tools & platforms to choose from [you should] be pragmatic with focus on Production, not just Experimentation.”
Jeff Kelly said you should place priority on “any tools that make it easier (via GUI, drag&drop, etc.) to deploy, administer, manage #Hadoop.” In terms of skill requirements, he said “beyond tech skills u need willingness to fail, curiosity, perseverance to get most from #hadoop analytics....as much as is required but don’t want to stifle innovation in early stages.”
Leon Katsnelson said “ I find that data people (eg. DBAs) have a hard time understanding Hadoop because it is more app developer centric....the reality of Hadoop so far has been that it was usable by the select few. Things like SQL for Hadoop are changing this.” Another IBM offering that addresses this need, he said, is “we have over 75K people learning Hadoop at http://BigDataUniversity.com.
Christy Maver said “lack of skills almost as big an issue as perceived lack of skills. Entirely new skillset not required for Hadoop.....As technology evolves (SQL for Hadoop), it invites broader audience to play with Hadoop, using existing skills.”
IBM DB2 (@IBM_DB2) said existing DBA skills are an important foundation. “Many of the skills our database experts currently have will be useful to get gold from data.”
How much #governance should you apply to data in Hadoop clusters?
John Furrier boldly asserted that “governance is kryptonite for big data at this stage of the game..let the flowers bloom..creativity & coding hence POCs.”
Jeff Kelly said “as #hadoop moves into production, must align with core governance policies.”
Leon Katsnelson stated that “Hadoop comes from environment where security and governance were not important. Reason to use enterprise-ready Hadoop distro....if your data demands protection and governance than you need it to demand it from Hadoop.”
Considering the predominant Hadoop deployments in big-data staging and sandboxing tiers, I said that “not much gov needed--YET--for #Hadoop data, which NOT YET used for reference data. Social, location, event, etc is gov-lite.... Governance not required for, say, #hadoop customer sentiment data, which is ephemeral, and is aggregated 4 patterns/trends.....EDW is where system-of-record (governance-heavy) data still largely resides. #hadoop is where non-SoR data explored/prepped....But #Hadoop can process structured/SoR data--customer, finance, etc. If it does, the usual governance/security applies.”
IBMer David Pittman (@TheSocialPitt) echoed that sentiment. “Hadoop or no, governance still needed for data trust and privacy concerns.”
Richard R. Lee said that in the broader context, data governance should be big-data-platform agnostic. “Governance requirements do not change based upon structure of information. Must be Trusted and Managed over long lifecycle....Governance key to Analytics success. Information feeds Analytics process, regardless of source. Must Govern consistently....Governance is a discipline and cultural belief set. Technology underpins it. Big Data must be part of this Community.”
Taking the discussion up to an even broader perspective beyond data governance, I stated that “governance of #Hadoop MapReduce models (version, access etc.) important for #datascience center o excellence best practice.”
How do I ensure Hadoop compliance with existing security, regulatory policies?
I placed the emphasis on top-down planning and design of Hadoop deployments for secuirty and compliance. “#Hadoop compliance with security/reg policies? Start by involving chief info security & compliance officers in planning....Ensuring #Hadoop compliance with security/regs demands regular audits, which require auditing/assessment tools....#Hadoop compliance requires assess degree to which data in clusters more or less security/privacy-sensitive than EDW data.”
Jeff Kelly said the bottom line is “if you can’t measure/monitor #Hadoop compliance w existing security controls, you’re not in compliance....also once you begin to merge and analyze disparate data sets, resulting insights/data may be very sensitive.”
Richard R. Lee said the downside of Hadoop non-compliance might be considerable. “Privacy, Security & Compliance part of Governance Community. Hadoop must play well in this sandbox in order to add value....The specter of a Privacy or Compliance breach based upon Hadoop or other Streaming solutions will stifle all progress.”
Leon Katsnelson said that Hadoop compliance-risk mitigation might require that you “choose data sets that don’t demand strict compliance.” But he stressed the downside over overzealous compliance. “Blind following of the policy based approach DOES kill innovation. Critical thinking must prevail....compliance is often used as an excuse to block change. Must not allow this to happen.”
What are greatest challenges in managing your Hadoop cluster?
David Pittman stated that “a very big challenge is lack of know-how to manage #hadoop.”
Leon Katsnelson focused on the Hadoop skills gap as a sticky issue for most users. “One of the big challenges I see is amount of resources that are needed especially in the face of uncertain outcome....finding people to manage Hadoop is the biggest challenge ....judging by the number of recruiters calling, there aren’t enough people available to run Hadoop....@IBMbigdata in many customer calls I get to meet the one and only Hadoop person the company has.”
Christy Maver observed that this skills shortage is a big risk factor for users. “Challenge for many is having only one or two #Hadoop experts. What happens when they jump ship? “
I provided a detailed discussion of the chief Hadoop cluster management challenges. “Greatest challenges in managing #hadoop cluster include managing software, hardware, databases, & data with unified tooling.... Big challenge is in managing “hybrid” #bigdata deploys that involve #Hadoop plus, say, EDW, in-mem, & NoSQL, in unified way....Big issue is enhance skills o DBAs, or hire outside? If outside, skills in short supply. Expensive... Tuning queries on your #Hadoop clusters is a challenge. You may need to select an optimal db layer--Hbase, say, vs. HDFS.”
I also noted that “#Hadoop complexity can be daunting mgt challenge” and that “‘appliantization’ of Hadoop helps streamline to crisp core Apache distro.” With that in mind, I stated: “To extent U have #hadoop cluster built from expert integrated systems with comprehensive mgmt tooling, you’re on right path.”
And I pointed to the challenges of managing “hybrid” big-data infrastructures that involve Hadoop and other platforms. “Hybrid #bigdata deploys require common mgt tools, common metadata, common virtualization layer, etc. Tall order to fill.”
Jeff Kelly noted that “among biggest #Hadoop challenge is workload mgt, ensuring jobs running optimally with tool set not that easy to use.” Also, he said “ identifying bottlenecks in a sprawling #hadoop cluster is a challenge .”
Richard R. Lee said the biggest challenges are “‘moving away from Batch to near Real Time’ and ‘how to Production-ize easily in traditional IT?’“
How do I ensure Hadoop doesn’t become just another data silo?
The urgency to avoid Hadoop siloes is not significant yet, said Leon Katsnelson. “Silo is not a concern at this time. Will be in the future but now it is all about finding good usecases.”
I stated that you can “ensure #Hadoop doesn’t become another data silo by standardizing on core vendor platform & evolving from a nucleus deploy...Must understand the fit-for-purpose profiles of #Hadoop & other #bigdata platforms to deploy each optimally, avoid siloing”
Dave Vellante (@dvellante), CEO and co-founder of @Wikibon, said avoiding Hadoop siloes involves starting with “a so-called data architecture that accommodates different data types, internal, external, etc....it’s cool to experiment but ultimately some serious planning needs to be done.” However, he said he is “fearful that bespoke biz initiatives will promulgate silo-ism, not solve it.”
Jeff Kelly said enterprise Hadoop technologists should “bring business into the conversation early, focus on solving a business problem w #hadoop - not some backroom experiment.” He stated that they “must develop a flexible architecture, ability to connect data sources as biz requirements demand.” He also noted that “it’s also a culture issue - need to get away from data ‘hoarding’ by departments, groups, even individuals.”
Richard R. Lee said “Hadoop must be part of ‘EDW Landscape’. Not a stand-alone repository or it will lose relevance. Make it Mainstream!....I counsel clients to have long-term Information Architecture Strategy, focused on Business Outcomes from all data sources.”
How should I bring the business side into conversations about Hadoop?
Jeff Kelly stated that the “best way to get the biz attention - show them how ur competitors are using #hadoop to beat you....need to make #BigData & #hadoop benefits real to the biz - highlight how it will drive revenue, lower costs, etc. talk $$$.”
Richard R. Lee said you should involve the business stakeholders “early & often” in Hadoop intiatives. “Business Leaders know that Information is their lifeline and catalyst for success. Hadoop supports ‘3rd leg of the stool.”
I noted that you should focus user attention on the fact that Hadoop is just one of many technological approaches for realizing business outcomes, and that other big-data approaches might also be suitable. “Users need to know many use cases for #hadoop, much overlap with other #bigdata platforms. Many ways 2 skin #bigdata cat.”
Christy Maver echoed that, saying “by focusing on the outcomes and clearly demonstrating use cases. Technology should be the afterthought.”
Leon Katsnelson said IT can engage users’ attention by “reduce focus on optimizing expense and increase focus on top line revenue. Hadoop is like a cattle prod for that.”
What types of analytics do you expect to use from Hadoop?
I stated that “#Hadoop analytics are any on unstructured data: social, machine, geospatial, event, clickstream, etc. Tons of NLP needed.... #Hadoop is “advanced” analytics, meaning predictive, statistical, machine learning, etc.” In the broader big-data scheme of things, said Richard R. Lee, Hadoop’s chief value is in “adding [the] sorely lacking unstructured dimension.”
Continue the discussion & check out these resources
Clearly, there are plenty of other resources you can and should look at on Hadoop.
- Here’s Nancy Kopp’s article: “If Hadoop Was Easy, Everyone Would Be Doing It.”
- Here are Leon Katsnelson’s articles “Hadoop is better with Big SQL from IBM” and “IBM Big SQL Technology Preview Program.”
- Here are my Information Week point-counterpoint on “Big Data Debate: Will Hadoop Become Dominant Platform?”; my IBM Data Management articles “True Hadoop Standards are Essential for Sustaining Industry Momentum” (part 1 and part 2), “Business Intelligence in the Hadoop Era,” “Hadoop: Nucleus of the Next-Generation Big Data Warehouse,” and “Hadoop Cluster Management“; and my IBM Big Data Hub blog: “Hadoop Myths Debunked.”
- Here’s a presentation about IBM Big SQL.
- Here’s Susan Visser’s article: “Introducing the IBM Big SQL Technology Preview .”
- Here’s the link to IBM’s Hadoop training institute.
- And, last but not least, please join the April 30 IBM broadcast: “Big Data at the Speed of Business”
Please engage with us and let’s continue this exciting discussion.