Taming Big Data

The realm of huge information flows is governed by new rules. What changes in the multi-petabyte world? And how will big data change what you do?

There’s a rush of information terraforming the IT world. It flows from the data generated by 4.3 billion cell phones and 2 billion Internet users worldwide, and joins the roiling torrent of 30 billion RFID tags and hundreds of satellites incessantly sending more signals with each passing second. Now, nobody ever has to deal with all the world’s data all at once. But when the whole pie grows, everyone’s slices get larger. When you start measuring the pie in zettabytes, even a small piece starts to get pretty filling. Here’s a sobering statistic: Twitter alone adds 12 terabytes of data every day—all text, and all added at a maximum of 140 characters at a time.

Dealing with data at this scale is a new frontier, and lots of different folks are approaching it in lots of different ways. But there’s a growing sense that we’re seeing the birth of a data challenge that’s like nothing that’s gone before. Some are calling it big data.

Big data: The three V’s

When they hear the term big data, most people immediately think of large data sets; when data volumes get into the multi-terabyte and multi-petabyte range, they require different treatment. Algorithms that work fine with smaller amounts of data are often not fast or efficient enough to process larger data sets, and there’s no such thing as infinite capacity, even with storage media and management advances.

But volume is only the first dimension of the big data challenge; the other two are velocity and variety. Velocity refers to the speed requirement for collecting, processing, and using the data. Many analytical algorithms can process vast quantities of information—if you let the job run overnight. But if there’s a real-time need (such as national security or the health of a child), overnight isn’t good enough anymore.

Variety signifies the increasing array of data types—audio, video, and image data, as well as the mixing of information collected from sources as diverse as retail transactions, text messages, and genetic codes. Traditional analytics and database methods are excellent at handling data that can easily be represented in rows and columns and manipulated by commands such as select and join. But many of the artifacts that describe our world can neither be shoehorned into rows and columns, nor easily analyzed by software that depends on performing a series of selects, joins, or other relational commands.

When you add volume, variety, and velocity together, you get data that doesn’t play nice. And as a result, dealing with big data demands a level of database agility and changeability that is difficult or impossible to achieve using today’s techniques alone. “In a traditional database, design is everything,” says Tom Deutsch, IBM Information Management program director. “It’s all about structure. If the data changes or if what you want to know changes—or if you want to combine the data with information from another stream or warehouse—you have to change the whole structure of the warehouse. With big data, you’re often dealing with evolving needs—and lots of sources of data, only some of which you produce yourself—and you want to be able to change the job you’re running, not the database design.”


Learning from extremes

Because traditional database managers and warehouses alone are often inadequate when dealing with big data, many organizations are adapting their systems to cope with a world of “badly behaved” data. These solutions vary according to the precise nature of the problems they attempt to solve—some are coping with high-velocity, high-volume information, while others must process enormous volumes of high-variability information. But it’s also possible to discern some common strategies and techniques that either reduce the magnitude of the information that needs to be stored or processed, or process it using newer, high-powered techniques that can handle the new, heavy-duty needs.

One company that’s coping with all three V’s is TerraEchos, a leading provider of covert intelligence and surveillance sensor systems that uses streaming data to monitor high-security facilities, national borders, and oil pipeline breaks. The TerraEchos Adelos S4 sensor knowledge system combines acoustical readings from miles of buried fiber-optic sensor arrays with data coming from diverse sensor sources such as security cameras and satellites. This enormous volume of high-variability, high-velocity data—sometimes terabytes in just a few hours—must be collected, combined with information coming from other streams, and analyzed at breakneck speeds to look for intruders, detect seismic events, or find equipment breaks.

We’re faced with analyzing data as it passes by on a high-speed conveyor belt. We don’t have the luxury of structuring it and putting it into a database first, because we want to be able to classify it within 2 to 3 seconds,” says TerraEchos CEO Alex Philp. “With digital signal processors sampling at a rate of 12,000 readings per second—and potentially thousands of different data streams—we have to use a totally different approach so that we can respond quickly,” Philp says.

For TerraEchos, the first casualty of this nearly overwhelming data onslaught is the “extract-transform-load” paradigm that has dominated data processing for decades: extracting data from its source, performing numerous time-consuming operations to transform it so that it fits neatly into a row-and-column format in a predetermined schema, and finally, loading it into a data warehouse. Increasingly, companies are transforming—and analyzing—incoming information as it arrives. If it meets certain conditions—for instance, if the audio stream shows a pattern that sounds like a vehicle approaching—it’s immediately flagged for more analysis and often triggers other data-collection or data-storage efforts.

We are constantly analyzing just a few seconds of data at a time,” says Philp. “If we find something, we can trigger processes that look for the corresponding video stream or look for something interesting and, if necessary, quickly save just a few frames of the video surveillance camera data for that particular area. It’s still a massive amount of streaming data, but that really cuts down on what we have to process and store.”

Filter first, ask questions immediately

To process the incoming torrent, TerraEchos uses analytics that are designed specifically for the types of data streams that the company uses. The company has incorporated IBM InfoSphere Streams into its own Adelos S4 sensor knowledge system. IBM InfoSphere Streams parses incoming data and distributes the computational work involved to a myriad of processors, and its analytics packages are designed to deal with specific types of data, such as audio and video. For example, some of the analysis involves rigorous statistical analysis on the incoming waveforms to determine the probable nature of possible threats.

The trend toward specialized analytics that are tailor-made for special types of data is already accelerating. For example, analytics with algorithms for textual understanding are already being used to pore through the vast streams of tweets and e-mails produced each day to look for such things as terrorism threats and shifts in the way that a product is perceived.

The TerraEchos system combines tailored analytics—in this case, from IBM InfoSphere Streams—with advancements in parallel-processing hardware to perform millions of simultaneous, rapid calculations on the binary acoustic data coming from thousands of sensors.

Many experts say that these techniques—filtering and analyzing data on the fly, using tailored analytics that understand how to process a variety of data in its “native” format, and bringing huge arrays of parallel processors to bear on incoming data—will soon dominate the data-processing landscape, as IT tries to cope with the special problems of high-volume, high-variety data moving at astounding speeds.

Five skill upgrades for big data opportunities

The big picture: Companies will probably spend less time and money defining, scrubbing, and managing the structure of data and data warehouses. Conversely, they’ll spend more time figuring out how to capture, verify, and use data quickly, so these are the skills to master.

“Today, DBAs and other IT people spend a lot of time creating cubes and stuffing data into them,” says IBM’s Roger Rea, product manager for IBM InfoSphere Streams. “That’s going to change. In the future, instead of reading data, transforming it, and then loading, you’ll just load it as fast as you can and transform it as you do your queries. This new approach is more agile, but it means a shift in the way we think about data. It’s very different from managing according to the traditional relational model.”

What can you do to be ready to seize new opportunities? Consider the following skills upgrades.

  • Learn to use new big data analytics. Some experts predict that data-mining software such as BigSheets—a spreadsheet-like interface used in IBM InfoSphere BigInsights—will make big data analytics more accessible to IT professionals and business analysts. Getting familiar with these tools and what they can do will probably benefit workers in a variety of IT disciplines.
  • Develop fluency in Java programming and related scripting tools. Many of the programs used to handle big data—such as Hadoop and MapReduce—are Java-based, so learning how to program in Java is an important skill. If you already know Java, you can probably start working through online tutorials or books on Hadoop.
  • Learn marketing and business fundamentals, with a focus on how to use new data sources. Already, affinity programs are exploring the complex factors that influence customer loyalty by mining such diverse sources as customer call-center data and Twitter feeds. Understanding how to use different sources of data and to apply them to such business problems will become more important for a variety of positions, from marketing to IT.
  • Develop a basic understanding of statistics. At the core of analytical software are the fundamentals of statistics. Knowing the basics of populations, sampling, and statistical significance will help you to understand what’s possible, and to better understand and interpret what the results mean. Your best bet is a marketing or business operations statistics course, where the material is more likely to be immediately applicable.
  • Learn how to combine data from different sources—especially public ones. Much of the power of large data sets comes in combining proprietary information (such as sales data collected by companies) with publicly available data sources (such as map information or government data). Just knowing what data is available can often spark new ideas for profitable ways to combine that information.

New technology for analyzing big data at rest

Although better ways of handling streaming information “in motion” are a large part of solving many big data challenges, just processing extremely large amounts of data at rest can be tough, if there’s enough of it—especially if it’s high-variety data. One approach to handling this broad set of problems efficiently is through massively parallel computations on relatively inexpensive hardware. For example, IBM InfoSphere BigInsights analytics software starts with open-source project Apache Hadoop, but substitutes its own file system and adds other proprietary technology.

Hadoop is a Java-based framework that supports data-intensive distributed applications, enabling applications to work with thousands of processor nodes and petabytes of data. Optimized for the sequential reading of large files, it automatically manages data replication and recovery. Even if a failure occurs at a particular processor, data is replicated and processing continues without interruption or loss of the rest of a computation, making the system somewhat fault-tolerant and capable of sorting a terabyte of data very quickly.

To achieve speed and scalability, Hadoop relies on MapReduce, a simple but powerful framework for parallel computation. MapReduce breaks down a problem into millions of parallel computations in the Map phase, producing as its output a stream of key-value pairs. Then MapReduce shuffles the map output by key and does another parallel computation on the redistributed map output, writing the results to the file system in the Reduce phase of the computation. For example, when processing huge volumes of sales transaction data to determine how much of each product was sold, Hadoop would do a Map operation for each block of a file containing transactions, add up the count of each product sold in each transaction, and then “reduce” as it returned an answer.

Because it’s so simple to understand and use this technology—since it relies so heavily on just two steps, Map and Reduce—Hadoop-based systems have been used to handle a wide variety of problems, particularly in social media.

Informing stream analysis with warehouse data

Some observers predict that the data warehouse will go the way of the rotary phone dial, but rumors of the death of the data warehouse are greatly exaggerated. Data warehouses will continue to play a big role in many enterprises, says IBM’s Deutsch. But they’ll increasingly be used with other software to “tease out” relationships in data that can then be used to handle incoming stream data on the fly.

It’s hard to know what to look for in a stream of data if you haven’t already analyzed some historical data to look for patterns,” says Deutsch. “But warehouse data can help you find those patterns.”

For example, Deutsch says that when University of Ontario Institute of Technology researchers first used stream-monitoring software on data captured from hospital neonatal wards, they were looking for patterns in unstructured data that might predict infant decline or recovery. They started by analyzing information from each infant, including audio recordings, heart rate, and other indicators, and eventually teased out a correlation between patterns in the audio recordings of the babies’ cries and the onset of newborn distress a few hours later.

These discoveries are being used to monitor new stream data to flag the change in cries and provide early warnings to doctors and nurses of impending problems. The ability to analyze huge amounts of high-variety warehouse data led to insights that have changed how new incoming streams are monitored.

Bringing analytics to a wider class of users

As data sets get bigger and the time allotted to their processing shrinks, look for ever more innovative technology to help organizations glean the insights they’ll need to face an increasingly data-driven future.

Just changing the way one views data can go a long way. “A lot of people don’t really think of unstructured data—such as video, audio, and images—as holding important information, but it does,” Deutsch says. “It’s really important to realize that this data can be just as valuable as the transactional data we’ve been collecting for years, and we have to look for new ways to put that information to work.”
One thing is clear—new ways of handling big data are accelerating almost as quickly as the flow of information that’s driving them. As TerraEchos’ Philp puts it, “I feel as if I’ve got a front row seat at the revolution.”