Technology Innovations for Enhanced Data Management
Data management has undergone significant change ever since the introduction of online transaction processing systems (OLTP) some 50 years ago. The level of change during this period, however, has not been uniform. There have been times when data management technologies and products became commoditized, and periods when the data management industry saw significant enhancements and major differentiators between products. Examples of these latter disruptive periods include the introduction of relational database technology in the 1980s, and the industry move toward the use of data warehousing in the 1990s. We are now entering another period of disruption due to the current industry focus on supporting and exploiting big data technologies.
The concept of big data evolved from solutions developed initially by web companies such as Google and Yahoo. These companies needed to manage and index huge volumes of web data, and existing technologies were unable to support this in a timely or cost-effective manner. To solve this problem, these web companies developed their own solutions. Most of these solutions were oriented toward IT programmers, rather than business users. The term NoSQL is often associated with these solutions because they do not use relational database technology. I prefer to use the term non-relational instead since some of these systems do support a subset of SQL.
Several of these non-relational systems have been donated to the open source community, which has led to the growing use of them in a wide range of industries. These systems enhance and extend the existing data warehousing environment by enabling many new types of data to be managed and analyzed, and offering improved price/performance for certain types of workloads.
The advent of big data and increasing interest in blending new sources of data (web, social computing and sensor data, for example) into the business decision-making process has also led to significant advances in relational database technology. Today’s relational database management systems (RDBMSs) now support a broad range of different data types and have been enhanced to support growing data volumes. Many of them also provide a rich set of analytic capabilities that enhance business intelligence (BI) processing.
To support growing data volumes and more advanced analytic processing, RDBMS products must be able to handle both the data and workloads required to support business needs. Performance has many dimensions. In terms of data management, the dimensions are the amount of data to be managed (data volume), its rate of generation or change (data velocity), the types of data to be managed (data variety), and the number of data sources, structures and relationships involved (data complexity). In the case of workloads, the dimensions are the types and complexity of the application processing (workload complexity), data currency and response time requirements (workload agility) and the makeup of the overall workload (workload mix).
The product and technology selected to support any given project will be determined by the actual performance requirements in any of these data and workload dimensions. The higher the requirement in any dimension, the more important it will be to choose an RDBMS that can be optimized to provide good performance in that dimension.
To stretch the boundaries of RDBMS performance across the many dimensions outlined above, vendors are enhancing their products in three key areas:
- New data structures. Examples here include support for columnar as well as row-based data stores within the same RDBMS, data stores structured to suit different varieties of data, and enhanced data compression techniques.
- Hardware exploitation. Hardware is constantly improving in price/performance, and it is important that RDBMS products exploit hardware performance improvements such as large hardware clusters, multicore processors, new processor capabilities such as SIMD (single instruction, multiple data), large memory spaces for in-memory processing, and hybrid data storage from high-performance solid-state drives to high-capacity but slower disk drives.
- DBMS extensions. Technologies here include in-database analytic functions, an intelligent query optimizer that understands and can exploit new data structures and hardware exploitation features provided by the DBMS, and an intelligent workload manager that can efficiently handle mixed and complex workloads.
Vendor Example: IBM
At the heart of IBM’s data management product set is the IBM DB2 relational DBMS. The latest release of the product is version 10.5, which incorporates many innovative enhancements for both OLTP and BI processing. There is insufficient room here to describe all of the new DB2 features in detail. Instead, I will limit discussion to the capabilities that support the four main business and IT drivers behind recent innovations in data management and business intelligence:
- New sources of data. IBM’s strategy here is to both extend DB2 and provide connectivity to other products such as IBM InfoSphere BigInsights. In the case of DB2, in addition to existing support for XML data, IBM is working on JSON support, which is available as a database technology preview program.
- Business ease of use and self-service. Requirements here are largely independent of the underlying DBMS, but it is important to note, however, that IBM is focusing in this release on the use of DB2 in both the mobile and cloud-computing environments, and on packaged appliances that are pre-optimized for specific OLTP and BI workloads.
- Improved performance. The performance features in DB2 10.5 are grouped under the label BLU Acceleration. Capabilities include in-memory columnar processing that speeds up analytical processing, compression techniques that allow data to be processed without the need for decompression, the ability to skip unnecessary processing of irrelevant data, and parallel processing improvements including exploitation of multi-core processors and processor SIMD instructions. These new capabilities not only improve OLTP and BI performance, but also reduce data storage requirements.
- Reduced IT costs and increased IT flexibility. The new release of DB2 contains a variety of different improvements here. Key enhancements include high-availability and disaster recovery for OLTP environments, enhanced on-line administration and maintenance, integration and automation of BLU Acceleration capabilities, simplified and automated workload management, and Oracle RDBMS compatibility enhancements. It is also interesting to note that DB2 supports both shared-disk and shared-nothing hardware architectures, which makes it easier to optimize for either an OLTP or a BI environment.
Several other IBM products complement the capabilities of IBM DB2. These include:
- IBM InfoSphere BigInsights is a non-relational and Apache Hadoop-based solution for managing and analyzing massive volumes of structured and multi-structured data. The latest release of this product includes an SQL interface and an enhanced file system. BigInsights is also the cornerstone of IBM’s PureData System for Hadoop packaged hardware and software appliance.
- IBM InfoSphere Streams is a stream processing system for filtering and analyzing large volumes of in-motion data.
- IBM Cognos and SPSS software is used to build descriptive, predictive and prescriptive BI applications.
Organizations now have a wide range of OLTP and BI solutions and options available to them. This rich set of choices can bring significant business benefits, but also makes the task of platform and production selection more complex for the IT organization.
As explained above, there are four business and IT drivers that need to be considered when evaluating the benefits of new and innovative technologies: access to new types and sources of data, business ease of use and self service, improved performance, and reduced IT costs and increased IT flexibility. These technology innovations can be used to enhance existing application systems or to deploy new ones. Enhancing an existing system is typically more difficult than deploying a new one, since existing technologies may limit the improvements that can be made.
Regardless of whether new technologies are used to improve existing systems or to build new ones, it is quite clear that a one size fits all approach to technology selection is not viable given the complexity of today’s business needs and growing transaction and data volumes. Instead organizations will need to deploy technologies based on a range of different, and often mutually exclusive, business and IT requirements. A flexible and integrated information architecture is therefore required to support these technologies. As discussed above, IBM provides a product suite that supports such an information architecture with IBM DB2 acting as the cornerstone of this solution.
This article is based on a white paper Claudia Imhoff of Intelligent Solution, Inc., and I wrote for IBM entitled, “Technology Innovations for Enhanced Database Management and Advanced BI.” Please click here to download this paper.