Embracing Big Data From the Warehouse
A pragmatic approach to the big data journey
Internet behemoths Google and Facebook have created value by managing and analyzing big data, which has prompted CIOs to ask whether new technologies might deliver positive results within their own enterprises. Aggressive growth rates projected by industry analysts encourage such thinking. Wikibon estimates the big data market as growing from USD5 billion in 2012 to more than USD30 billion by 2015 and up to USD53.4 billion in 2017 (you can see the free report here). IDC is more conservative, estimating that the market will reach USD16.9 billion in 2015. Many IBM clients start their big data journeys with their largest managed collection of structured data—their data warehouses. This article reviews how organizations can create real business value when combining technologies that fall under the umbrella of big data management.
Deploying big data management technologies
Technologies deployed to manage big data include:
- Enterprise data warehouses
- Data warehouse appliances
- Apache Hadoop clusters
- Log analysis and complex event processing on streaming data
- Software to move and integrate data between individual management nodes
A team of IBM authors provide deep insight into the newer of these technologies in a free IBM eBook called Understanding Big Data.
Bringing big data under management is the first step. Organizations create value when they use algorithms to analyze data and develop an understanding of cause and effect. However, these analytic algorithms are computationally demanding and are most compatible with data management technologies running on massively parallel processing hardware architectures.
Modernizing a data warehouse
Many first-generation data warehouses are deployed on computing architectures that are poorly matched to the demands of analyzing big data. By modernizing their warehouses, and replacing older database management systems running on symmetric multiprocessing hardware with an IBM Netezza data warehouse appliance, hundreds of organizations now exploit big data.
In the world of mobile telephony, maintaining high-quality service across the network is fundamental to customer satisfaction—unhappy customers will leave for a competitor. T-Mobile’s first-generation warehouse could not aggregate data at sufficient scale for the company to understand events on its entire network. At 40 TB, the Oracle-based warehouse was beyond its limit, denying the company a deep understanding of the quality of its service. Modernizing its warehouse with Netezza freed T-Mobile to load 17 billion network records each day, and analyzing this data provides detailed insight into quality of service and customer satisfaction. The warehouse currently manages 2 PB of data, providing analytics support to 1,300 business users, and is so successful that adoption has spread beyond its original user base in network operations to revenue assurance, billing, marketing, and customer care. You can see a video of Christine Twiford, Manager, Network Technology Solutions at T-Mobile, talking about the company’s big data journey here.
Augmenting a data warehouse with an appliance
Banks and other finance companies must exercise tight control of their computing assets to comply with industry regulations. Given the distributed and networked configurations of modern computer systems, meeting regulatory demands can be a challenge, and understanding data residing on each machine complicates this already-difficult task. As in other industries, banks have turned to computing to improve inventory control effectiveness and reduce costs, but their efforts have often been piecemeal. For example, one bank previously relied on more than 40 systems that together managed hundreds of terabytes of data. This “system of systems” was cumbersome and ineffective. Answering a seemingly simple question about one computer's configuration and the data under its management could take weeks of work.
A rethinking of data management strategy prompted this bank to create a new infrastructure master data hub that teams IBM DB2 with a Netezza appliance. By taking advantage of the massive scalability offered by DB2, the bank consolidates those hundreds of terabytes previously distributed across more than 40 systems into a single, integrated database with a common data model. Beyond data consolidation, DB2 also serves as the operational data store (ODS), answering short queries with a very high arrival rate. Integration software moves data quickly from the ODS to Netezza for reporting and advanced analytics. Now the bank's data configurations are under the control of an overarching governance model while its business has a near-real-time view of its asset inventory.
Using Hadoop with a data warehouse
Hadoop is a highly reliable and scalable data processing system. Its benefits include the ability to load data without a schema, and processing and analyzing this unstructured (or poly-structured1) data at scale using inexpensive hardware. Hadoop processes data in batch; it has no optimizer and cannot support random access and interactive queries. These are the strengths of database systems such as Netezza.
Edmunds.com, the online car sales company, teams Hadoop—as a data ingestion engine—with its Netezza warehouse, while the Netezza Hadoop adapter moves data between these systems. Hadoop analyzes massive volumes of unstructured data, including text, voice, and tweets, outside of the data warehouse, transforms it to relational formats, and passes structured data to Netezza, where the Edmunds.com analytics team integrates social media and consumer feedback into all aspects of their business. A slide show with more details can be accessed here.
Supplementing a data warehouse with real-time complex event processing
The Smart Grid Demonstration Project at Pacific Northwest National Laboratory (PNNL) is the largest regional collaboration in the United States, involving 60,000 customers in 11 utilities and across five states. The project manages big data by pairing log analysis and real-time complex event processing (using IBM InfoSphere Streams) with a data warehouse appliance (IBM Netezza). InfoSphere Streams analyzes millions of messages, each communicating status or event, cascading from grid control systems and detects problems with the potential to disrupt electricity supply. This same data is then passed to Netezza, which while managing a history of events, runs deeper analysis to identify trends and other patterns indiscernible in real time alone. These analyses improve grid reliability and reduce operational costs on a dynamic new data management platform. Netezza feeds these analyses back to InfoSphere Streams, refining its real-time analyses of control system data. A detailed presentation from Ron Melton, the director of the Smart Grid Demonstration Project, is available here.
As an example of societal computing, PNNL’s Smart Grid Demonstration Project operates beyond the scale of many enterprises, yet its data management platform is instructive to CIOs investigating the value of real-time complex event processing. Combining InfoSphere Streams with Netezza creates a data management platform for new classes of applications—including financial markets trading analysis, fraud detection, network quality-of-service analysis, network threat detection, asset monitoring, and marketing campaign management. These and other applications will become more commonly used as inventory and assets are IP-enabled.
Advancing the enterprise platform
We are witnessing the emergence of an advanced enterprise data platform. By augmenting structured data stores with technologies designed to manage and analyze massive quantities of data in motion and at rest, enterprises equip themselves to exploit any and all data available to them. This new big data management platform is distributed, not monolithic—different platforms, each specialized and optimized to their task, work in tandem and share data and the results of their analyses. Starting this big data journey at the warehouse is a pragmatic approach, because they represent the largest managed data stores and are the central focus of an organization’s skills and experience in data integration, security, and governance.
1 Industry analyst Curt Monash coined this term. For his discussion of poly-structured data, see http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data.