Recap of IBM Twitter chat: Storing Big Data
Big data is intimately bound up with storage infrastructure. Tuning your big data fabric for disparate workloads can, under worst-case scenario, challenge your database administrators with complex set of tasks that must be repeated every time new sources and data types are added, new queries and data analytic jobs are developed, new performance and availability requirements are imposed, and so forth.
On Wednesday, April 24, I participated in an IBM Twitter chat with analysts, influencers, thought leaders, fellow IBM-ers, and others on the topic of how you should be storing big data and managing those storage resources. The event took take place from 12-1pm EDT and used hashtag #bigdatamgmt.
Here are highlights from the discussion. Note that I've edited the questions that the moderator (the all-knowing @IBMBigData) asked; correlated and edited participants' tweeted responses for legibility; smoothed out some of the incomprehensible verbal junk that Twitter practically forces us all to use; and concatenated some people's individuals' tweets through ellipsis marks in order to give the semblance of coherent thoughts emanating from our caffeine-crazed crania. I've left some of the hashtags in the discussion, just to give you a little bit of that Twitter accent. I haven't spelled out all the industry acronyms that we threw into the tweets; the reader should simply Google for those.
Here now are the highlights.
Can one storage technology suffice for all big data?
None of us responded to this question in the blanket affirmative.
I tweeted that "one storage tech DOES NOT suffice for multistructured. Each data type has distinct storage, compress, retrieval reqs....Structured data in batch environments usually uses HDD. More real-time requirements might use SSD. Streaming uses cache."
David Floyer (@dfloyer), a data-center professional, seconded that sentiment. He tweeted that "the types, origin and value of big data are so diverse that one storage technology is unlikely."
Fellow IBM-er David Pittman(@TheSocialPitt) noted the heterogeneity of storage resources in today's IT environment: "There's usually a mix of storage tech in an org - HD, SSD, cloud."
Jeff Kelly (@jeffreyfkelly), technology market analyst covering big data for The Wikibon Project, said big-data storage will require old and new approaches to coexist. "In an ideal world a single storage platform could serve as a foundation of big data deployments - but not an ideal world!...Likely need a mix of storage technologies, incorporating legacy storage with new forms like HDFS."
Richard Lee (@InfoMgmtExec)--executive consultant in governance, risk, compliance, advanced analytics, and business informatics--alluded to a "hierarchy" of storage approaches for complex big-data requirements. "Storage has long been both bottleneck and enabler for Info solutions. Now have a great hierarchy to work with.... DB's ala BLU can take full advantage of new storage hierarchy allowing for real-time Predictive Models in memory."
Avadhoot Patwardhan(@AdvaitaAvi)--whose interests intersect with mine on the big data, social media, and "inner science" front--asked a great question: "Is there a story for the Hadoop Storage Stack?" Yours truly responded: "Hadoop Storage stack? Hadoop has diff underlying stores (HDFS, Hbase, etc.). Different mixes of HDD & SSD optimal."
What is the optimal mix of rotating vs. solid-state storage for big data?
We all recognized that solid-state drive (SSD) is coming into big-data environments very rapidly, pushing traditional rotating media, especially hard-disk drives (HDD), ever further to the periphery.
I stated: " Big data thrives on "fit-for-purpose" storage deployd differentially by functionally differentiated tiers....In a 3-tier big data arch, should put SSD in front-end query/access for high-perf. Should put hi-cap HDD in hub and staging....Rotating disk best for lower-speed, batch I/O. SSD best for real-time, interactive, fast query & exploration. Mix/match 'em."
Richard Lee focused on the need for a shifting hybrid approach to deploying mixed-storage resourcs: "As a consultant, I would say "it depends' i.e. Types of data sources, Modeling schema, Analytics workloads, etc....Given sophistication of storage arrays, databases & integration stacks one rarely accesses rotating disk directly....SSD arrays will slowly replace conventional arrays as price & performance improve. Still a hybrid world out there."
Jeff Kelly noted that the optimal blend of storage technologies depends in part on the access, performance, and cost-efficiency requirements of disparate data. "Disk for cool and cold data, SSD for hot data that needs to be accessed frequently ....hopefully with automated and intelligent management to determine the 'temperature' of various data."
David Floyer went even deeper into the topic of which storage technology is optimal at each step in the data life cycle: "Initial landing for all active data should be flash storage, & metadata held in flash. Migrate passive data to magnetic tub....Moving data very expensive; just move data down from flash to magnetic tub when probability of use is very low."
What are the benefits of using Hadoop as a new data archive?
We all agreed that Hadoop has many advantages an archiving platform for older, colder data.
I said: "Hadoop is an ideal archiving tech, due 2 peta-scale, file-oriented storage, SQL-oriented query/access, & deep analytics....Archiving thrives on cost-effective voluminous storage plus rich search & retrieval tools. That's Hadoop's sweet spot."
Jeff Kelly said "Hadoop allows you to inexpensively store historical data but also access it when needed (relatively) quickly."
Avadhoot Patwardhan said "Hadoop is a powerful, economical and active archive. Thus, Hadoop sits at both ends of the large scale data lifecycle ....Archiving older data to cheaper storage while keeping the more recent data on the main cluster."
Michael Martin said Hadoop has "built-in data archive capabilities to analyze historical data and conduct real-time data analysis....Provides a path to move cold data into an active archive, which allows 4 historical data analysis...Delete data debris even w hadoop as archive, data of no value is still NO VALUE why pay anything for it?"
How can you best control storage costs for big data?
We discussed a wide range of big-data storage cost-control techniques.
I said: "Controlling storage costs involves cheap stor devices, max compression, stor-efficient dbms, multitemp mgmt, smart retention....Selectively profiling data before loading into big data storage is essential. Don't turn your Hadoop into costly Ha-dump....Storage-efficient database platforms,e.g., columnar, are key to big data storage cost control. Enable efficient compression....Actually, best cost control for big data storage is max consol of redundant storage resources, minimize data move, & max in-db analytics."
Michael Martin discussed retention policies that help control big-data storage costs: "Delete data debris even w hadoop as archive data of no value is still NO VALUE why pay anything 4 it?....Defensible disposal helps organizations reduce run rate storage costs while also reducing risk....Value-based Archiving: reduce storage costs, a proven way for CIOs to make the most of limited budgets....Only store data with business value or what you are required to by law or legal holds, delete the rest."
David Floyer discussed other cost-control strategies: "Key strategy is to minimize the movement of data; organized data centers and cloud services to minimize data movement.... Databases need to create metadata, compress/de-duplicate data in flash, & understand data location, & minimize data moving....Keep it simple; active data in flash, metadata in flash, all else in geographic distributed erasure encoded magnetic tubs."
What is the most storage-efficient database approach for big data?
I stated that the "most storage-efficient database architecture is whatever enables max compression. Often, that's columnar (e.g., BLU)...In addition to columnar, there are also storage-efficiency advantages to key-value, inverted indexing, & tokenized storage."
Richard Lee said "best DB is the one that uses all the storage tools available; Column vs Row, Flash/SSD, Adaptive Compression, etc." He confessed to being "biased towards DB2 as the true UDB," but added that "architecture continues to evolve in all dimensions with no real limits.....There are many stand-along best of breed DB types now, but few integrate seamlessly around common repositories."
When will the tipping point come for all-SSD big-data environments?
I stated: "I've seen TCO numbers on SSD vs HDD that indicate tipping point has already come. SSD is more cost-effective over lifecycle...In terms of acquisition cost--SSD vs. HDD--on a per-TB basis, I predict the tippiing point toward SSD will be in 2015." To support my prediction, I called everybody's attention to the following recent article: "SSD Flash Storage At Tipping Point: IBM" (http://bit.ly/17ivsMG ).
Jeff Kelly said "we're getting there faster than you think - for data-intensive companies, already going all SSD."
Richard Lee said " Tipping point is driven by 'the knee of adoption.' Once price vs. capacity vs. performance is close to HDD then 'game over.' Consumer uses Flash/SSD have lead the way for use in Analytics & Big Data. Normally other way around. Classic Disruption."
When should you store big data in public cloud vs. on-premise?
I said you should "store big data in public cloud when doing so is cheaper, more manageable, & & just as robust/secure as storing in-house. ...If data already in public cloud, often makes sense to leave there & move analytics to data....One advantage of public cloud storage is variable costs. For temporary/surge/project-centric storage."
Michael Martin noted that "cloud is widely understood to include "private" deployments in addition to or in lieu of public cloud." He said "much of your source transactional data is already in a public cloud.....Public cloud offering MIGHT b only option if U need petabyte-scale, multistructured, big data capability."
Richard Lee responded: "When it makes sense! Information Architects should have full pallet of options available. Hosting location is one of them.....Public Cloud options match Hosted ones on virtually all metrics except cost & agility. Major advantages to the Cloud here.
David Vellante (@dvellante), CEO and co-founder of The Wikibon Project, is a bit less sanguine about cloud storage. He tweeted: "When you don't care so much about losing access to it!....My summary of Public Cloud SLAs: "We'll do our best but if we fail email us and we'll get back to you within 24 hrs (maybe)....[but] having compute & storage resources proximate to the data is a good idea...[however] warn users that renting is almost always more expensive than owning in the long run."
Jeff Kelly tweeted "when you're bringing in large volumes of 3rd party data...yea, pub cloud security gets bad rap - often more secure than enterprise data centers....but still questions about public cloud SLAs and more granular security controls you may need."
How will data retention practices evolve in era of big data?
I stated that retention "practices evolving 2 recognize much big data is "breadcrumbs" (clickstreams tweets etc) that don't need long-term retention....Data retention practices evolving 2 recognize public cloud storage is most cost-efficient retention 4 largest data sets....Data retention best practices must evolve selectively archive most storage-hogging data in cheapest cloud storage available."
Richard Lee said "eDiscovery, etc. drive Retention Schedules. big data cannot be ignorant of these requirements. Use Cases must support....Info. Gov. must have seat at table with Analytics Solutions Architects and vett Use Cases to understand Retention issues....Users of big data must understand the 'costs of persisting data to support Retention regs' before designing solutions."
Jeff Kelly aid "CIOs and legal depts going to have some interesting chats now that its affordable to basically store everything."
IBM big data summed up the consensus as "#InfoGov must be involved with analytics, biz opps, other to ensure retention is cost-effective, secure."
What are the challenges of using Hadoop as a storage layer for "cold" data?
I responded: "'Cold' data is least frequently accessed biz data that still must be retained. That's archive. Challenge is cost control....The question is essentially, what are challenges of using HDFS for cold storage. Key challenge is compression."
Avadhoot Patwardhan pointed to the challenge associated with the fact that "Hadoop clusters utilized for "cold" are typically I/O intensive."
Richard Lee said "all Information Architects shoud develop "heat map" of entire Analytics environment to meet changing agility requirements."
Michael Martin said "Because Hadoop is linearly scalable, you will increase your storage whenever you add a node." Another challenge is that "the storage layer, the Hadoop Distributed File System, only supports a single writer...Random access to the data is not really possible in an efficient manner either.
Jeff Kelly said it's "still not trivial to explore data in Hadoop, need java/Map Reduce skills - bit still a lot better than tape!."
IBM big data stated the consensus: "Hadoop as storage layer for 'cold data' - doable but need the skills, and an eye towards compression!"
Continue the discussion & check out these resources
Here's the full Storify transcript of this tweetchat: http://storify.com/IBMbigdata/storing-big-data-bigdatamgmt-chat.
Here's an article on IBM offerings for storing big data in the cloud: ow.ly/knMbM.
Here's an article on IBM offerings for value-based archiving and defensible disposal: ow.ly/knMD5.
Here's an article on IBM's investment in Flash storage and solid-state drive technology: ow.ly/knN3o.
Here's an article for using the forthcoming IBM PureData System for Hadoop for moving cold data into an active archive for historical data analysis: ow.ly/knNJM.
Here's an article on IBM smarter storage offerings: ow.ly/knPNi.
Here's an IBM whitepaper on smarter data placement and access: ow.ly/knQaq.
Here's an article in which IBM discusses the industry tipping point at which SSD and flash storage become more cost-effective than rotating media: http://bit.ly/17ivsMG .
Here's an IBM webpage that discusses intelligent archiving of enterprise application data and addresses data retention: ow.ly/knTii.
Here's an IBM presentation deck on data protection and retention: ow.ly/knTAx.
Here's a webpage that discusses some storage limitations of Hadoop Distributed File System: ow.ly/knUwT.
Here's an article on why storing data in public cloud may be more secure than on-premises: http://bit.ly/17dZWzp.
Here's a Wikibon note on storing big data in the cloud: http://wikibon.org/wiki/v/Weighing_the_Costs_and_Benefits_of_Big_Data_in_the_Cloud
Keep adding your thoughts and ideas, and register for the following broadcast on Tuesday, April 30 for more: "Big Data at the Speed of Business"
Please engage us and let's continue this exciting discussion.