Blogs

Platforms Well Suited for Complex Analytics

Combine traditional data warehouses and open source components for specific big data analytics needs

Big Data Industry Architect, IBM

Despite the ongoing excitement around big data and the Apache Hadoop platform for managing and analyzing unstructured data, big data—just like all data—needs to be captured, stored, made secure, analyzed, and eventually archived and deleted. Problems arise when incompatible technology is pressed into service to manage or analyze very large volumes of data. For example, analytics tools such as the Statistical Analysis System (SAS) and IBM® SPSS® predictive analytics software with internal storage formats may be challenged by working with huge amounts of data, which is generally in terabytes. Archiving this much data can become an issue, especially when archiving it on platforms such as Hadoop, because they are not engineered for structured archiving processes.

Matching technologies with data types

Today, big data is being stored in many locations, not just single data warehouses. Certain storage or archival technologies are more efficient at storing and analyzing certain types of data—text in some, structured data in others—than they are in storing and analyzing other data types. However, the type of data—text or numeric—is not a major factor when opting for a storage platform because all types of data can be managed and analyzed on several platforms.

But many storage platforms use different access methods and query technologies, which can create barriers for cross-platform analysis. The question, then, is how can multiple data extraction languages and tools be used to perform complex analytics on data in multiple data storage technologies? To understand the various platforms for storing data, creating broad categories and placing specific technologies from vendors into these categories can be helpful. Organizations can consider the following specific data management technology groups for big data analytics.

Online transaction processing (OLTP): Sometimes referred to as the Swiss Army knife of databases, OLTP is often found in the back offices of many organizations processing transactional data. These databases can be more challenging to scale to many terabytes of data than others, and they can be a little more complex to use for analytics by line-of-business users than other platforms. OLTP systems are often not cost-effective, primarily because the skills required for their implementation and maintenance can be expensive. OLTP databases are designed to be quite safe, secure, and enterprise-capable. OLTP databases include IBM DB2® data management, Oracle Database 12c, open source MySQL database, and Microsoft SQL Server.

Data warehouse appliances: Data warehousing can be highly efficient for many analytics scenarios, and they have the capability to handle very complex query workloads. Data warehouses can scale to huge data volumes; however, these platforms can be costly. They are often deployed like traditional relational database management systems (RDBMSs) but with one key exception: most data warehousing appliances are not appropriate for operational workloads such as enterprise resource planning (ERP) or billing systems. Although data warehouses still need some data modeling, they are not as dependent on efficient data models to perform well. These appliances usually offer excellent support for advanced analytics. This category includes Teradata, IBM PureData™ for Analytics, powered by IBM Netezza® technology, and Greenplum data warehousing.

Analytics servers: Specifically designed for analytics, these platforms not only manage data, but they generally get their data from OLTP systems or data warehouses. More importantly, these platforms integrate data from multiple sources to accommodate advanced analytics. To help ensure performance, analytics servers should be set up to push down the analytics into the databases. Moving terabytes of data from one platform to another is not practical for analysis. Organizations deploying analytics servers need to look into the configuration and determine if the code is executing on the database platform or locally on the analytical platform. The execution should be in the database as much as possible. Many big data platforms support analytics directly through SQL or the R analytical language natively. Analytics servers that can push analytics onto the scalable platforms include the SAS/Access family, IBM SPSS Modeler data mining workbench, and IBM SPSS Analytic Server.

Hadoop: Platforms based on the open Hadoop implementation are highly scalable and can support some analytics through such features as the IBM InfoSphere® BigInsights™ Big R library and the IBM Big SQL interface directly in the platform. The Hadoop system should be wrapped with enterprise-quality security, archiving, and monitoring. Hadoop platforms are not as fast as appliances when supporting complex analytics, but they can be implemented cost-effectively. And Hadoop platforms do not need data models up front, but they are not recommended for some enterprise-scale operations such as billing and ERP.

Deploying platforms for the right analysis

When designing a platform for complex big data analytics, organizations can apply the following basic rules:

  • Organizations should consider Hadoop when they are uncertain about the types of analytics to be performed—for example, data mining or simple aggregations—or how to model the data. Hadoop is also advised for organizations that have a large volume of data and do not need operational capabilities such as single-row lookups and inserts.
  • Data warehouse appliances may be preferable for organizations that fit the previously mentioned considerations for Hadoop but also need enhanced performance and are capable of building simple data models.
  • Organizations with analysis requirements for comparing large tables or comparing billions of records with millions of records—for example, matching credit card numbers with transactions—can consider deploying a data warehouse appliance.
  • Certain data-mining algorithms and analytics can be made to run in parallel, making them highly efficient for large platforms. Finding positive or negative sentiment in social media, for example, can be a parallel operation that is well suited for Hadoop. However, neural network and classification algorithms may require a single, powerful analytics server to perform analyses efficiently. The performance of these analyses can vary widely depending on the platform type.

For some organizations, a combination of these platforms can be an optimal implementation. Hadoop platforms can handle long queries that do not require deep analytics, and data warehouse appliances are designed to excel at analyzing structured data and complex analytics. These platforms can be integrated dynamically through the analytics platforms to provide a smooth access method to the data.

Please share any thoughts or questions in the comments.