How to Avoid Drowning in Big Data

Client Technical Manager, Analytics Solution Center, IBM

In the 1980’s, John Naisbitt wrote, “We have for the first time an economy based on a key resource [information] that is not only renewable, but self-generating.  Running out of it is not a problem, but drowning in it is.[i][i]”  Little did Naisbitt know how much information we’d be creating 30 years later.  By some estimates we are generating over 1 zettabyte (1x1021) per year[ii][ii].   How do you avoid drowning in all that data, and gain insights?  That is the realm of Big Data Solutions.

The IBM Analytics Solutions Center's June seminar was on Big Data.   We started off talking about the ‘big data conundrum.’  The volume of data is growing so rapidly, that the fraction of data that an enterprise can analyze is decreasing.  Because of this gap, we’re getting ‘dumber’ about our organization and job over time.  This is driving the need for improved analytics and platform technology that can help us to process this large volume of data.

What do customers want to do with big data?  Popular requests we’ve heard include: I/T log analytics, RFID tracking and analytics, fraud detection and modeling, risk modeling, 360o view of a person/place/thing, call center record analysis, and fusion of multiple unstructured objects (e.g., pictures, audio).   Since we now collect so much data, the possibilities are only limited by your imagination –and our ability to extract insights from the data.

In order to process these large volumes of data, special systems and applications are being deployed.  Many of these are based on the Apache Hadoop middleware which supports a distributed file system and processing environment for scalability, flexibility, and fault tolerance.  IBM’s big data platform includes offerings based on Apache’s Hadoop with enhancements to improve workload optimization, security, and cluster hardening.  The IBM offering (BigInsights) also comes packaged with advanced analytical capabilities for data visualization, text analysis, and support machine learning analytics.  One interesting item was the announcement that the enhancements would be packaged to allow them to work with other Hadoop distributions, such as the Cloudera™ hadoop.  Another offering discussed in the seminar was the Stream computing offering designed to efficiently process “data in motion,” such as stock ticker streams and social media feeds. 

One of the biggest challenges given the huge volume of information is finding the right information.  Governments, Utilities, and financial companies have this problem in particularly because of the huge volumes they deal with.  A recent IBM acquisition, Vivisimo, has developed a next-generation search engine to provide search across multiple big data and traditional platforms.  Vivisimo provides a scalable search application framework that can perform a federated search across many different data sources including the web, social media, content stores, and more traditional structured database systems.  One feature that may be particularly appealing to government agencies and corporate environments is its ability to map individual access permissions of each data item, authenticate users against each target system and limit access to information a user would be entitled to view if they were directly logged into the target system.

They offer a clever search tool that provides easy navigation and discovery, using both structured metadata (faceted search) and keywords that the program dynamically discovers based on analysis of unstructured content. Vivisimo provides an agile development layer, to allow users to quickly create applications and dashboards to discover, navigate and visualize information.

The seminar also featured a customer case study of using big data for cybersecurity mission operations.  IP traffic is growing at 29% CAGR, and with it, the cyber-threats they are facing. Unfortunately, the customer’s headcount isn’t growing, so more automated ways are need to detect and respond to threats.  For this application, timeliness is key – dealing with threats in real-time.  To identify potential threats, they want to be able to compare current threat and traffic data to norms from the recent past, and similar periods in the past.  Their solution utilizes the Netezza data warehouse appliance for near real-term data and IBM BigInsights for long term storage.  The solution eliminates as many mundane “data retrieval” tasks as possible for the analyst, and provided the analysts with those datasets that had a high probability of being “interesting.” In this way, the solution helps the analyst deal with the extreme data volumes, and yet remains flexible to the changing threat environment.

Do you have an opportunity to use massive amounts of data to accomplish a business/mission objective that can’t be done when we were limited to small volumes of data?  Do you have an innovative solution?  We’d like to hear your stories about big data.

For more on the Big Data seminar, see our ASC website under past events.

[i][i] Naisbitt, John, Megatrends: Ten New Directions Transforming Our Lives, NY Warner Communications Company, 1982, pages 23-24

[ii][ii] IDC Digital Universe Study, 2011