Avoid the Workload Bottleneck
Isolating workloads helps revolutionize data warehouse design by capitalizing on recent advances in technology
The information explosion has rocked the world of data over the last five years. Mobile technology advancements, the availability of tablets and smartphones, and the rapid growth of social media have all contributed to both production and consumption of data at volumes never seen before. Other contributing factors include recommendation engines; cool new visualization capabilities for business intelligence (BI); advances in software technology to enable machine learning; and smart systems in hospitals, airports, airplanes, automobiles, and more.
Different types of data with varying degrees of complexity are produced at multiple levels of velocity. All this data is now used for learning more about an enterprise, its customers, the brand, its equity and clout, and the competition—all the factors that drive a business, which are quantifiable and available for use in decision making. This explosion of information has caused a flurry of hyperactivity in the enterprise, and business users are wondering about adopting new and additional data for analysis and assessments.
As this story unfolds, IT and data warehouse teams may start dreading the very thought of rising data volumes. Why? All this data means that existing systems now have additional work to do processing this data. This increase in processing is the challenge, and it is where the nuance of workload is introduced.
Working with workloads
A workload can be defined as the execution and completion of a task that utilizes a mix of resources—for example, processing power, storage, disk I/O, and network bandwidth. At any given time, any system that is processing information is executing a workload. Processing of workloads is common to the world of data warehousing and BI, and very applicable to the underlying information architecture.
Think for a minute about the average data warehouse and the information processing associated with it. On average, there are three types of workloads in a standard data warehouse. There is data from multiple sources that are being loaded and integrated, downstream systems and data marts that are accessing data from the same data warehouse, and analytical platforms executing and analyzing large complex queries.
Then, load complexity enters into this equation. There are jobs that need to load large fact tables, load large history tables, and move data from one level to another depending on the age of the data. While executing these specific jobs, a storage area network (SAN) environment may be showing signs of slowing down. The SAN may be shared across online transaction processing (OLTP), data warehouse, and analytical databases. In addition, the storage is sharing disk space between the data warehouse and OLTP systems.
Before proceeding further down this path, consider the architecture that is in place. Typically, it is a three-tier architecture: application, data, and storage. In most cases, two out of the three tiers are shared across the enterprise. When these systems are commissioned and designed, the basic premise is to provide sustained speed of performance per each service-level agreement (SLA).
In the first few months, because of the volume of data and number of line-of-business users, the performance may meet and exceed all SLAs. Within six months to a year, the performance curve can slow down, sometimes drastically. This slowdown is not uncommon, and it is not caused by end-user adoption of the data warehouse. Instead, there are different categories of activities that affect a typical data warehouse (see Figure 1). The following categories affect conventional data warehouse systems:
- Performance: Measured in SLAs, network throughput, SAN throughput, I/O, and memory consumption and affected by increase of data types, query volume, query types, data center growth, and end-user adoption
- Growth: Measured in increase of data sources, data types, query types, end users, and data volume
- Complexity: Measured in types of reports, analytics, and dashboards built by end users
Figure 1: Data warehouse parameters during a single year
Trafficking workload flow
Different types of workloads can impact databases. There are reporting workloads such as static and interactive BI reports, and analytical workloads such as those generated by analytical queries and statistical models, including IBM® Statistical Package for the Social Sciences (IBM SPSS®), SAS, and R software. There are also computing workloads such as those generated by intensive SQL operations within the database.
Workloads that impact networks include functional workloads such as commodity tasks executed from web services as part of service-oriented architectures (SOAs) and traffic-generated data workloads. Workloads that impact SANs include storage workloads for handling data storage and retrieval.
Workloads can impact overall data warehouse performance in much the same way traffic impacts a highway. In one scenario in which traffic flow on a highway is heavy but moving, as long as the vehicles stay in their lanes, traffic continues to progress—albeit slowly. Similarly, in a data warehouse, some queries are like fast cars with small result sets, while other queries are more like trucks and buses, but the workloads are still moving and bringing back information.
However, consider a scenario in which traffic is quite congested and vehicles continuously drive across lanes jockeying for position. This case is similar to the data warehouse when all types of queries across all data types are allowed, without regard for the existing database structure, the way data is stored, and how it will be processed. As a result, there is loss of adoption, lack of trust, and overall failure. These two traffic scenarios demonstrate why understanding workloads is critical for successful data warehouse design.
The evolution of technology in the past decade has provided the flexibility to design and, in many cases, potentially re-architect existing data warehouses with isolated workloads. Isolated workloads enable designers to use the appropriate tools and infrastructure for the purposes they were built to accomplish. For example, unstructured or semi-structured data can be processed in its isolated environment and the result sets can be integrated as needed into the data warehouse.
In addition, complex statistical functions can be processed within the database to generate a result set that the application can use. Plus, in-memory technology can be used to isolate report processing and complex calculations from the reporting application server’s disk drive, and raw data can be loaded at high speeds into memory.
Some popular frameworks are intended to help organizations build their own design for isolating workloads (see Figure 2), such as Bill Inmon’s DW2.0 and Zachman Framework 3.0. In contrast, some options such as an information lifecycle management (ILM) technique to manage workloads or a database workload optimizer are likely well-suited solutions for optimizing only one layer, such as the database.
Figure 2: Workload isolation architecture
The architecture offers scalability and minimized data movement across all layers, sustained performance, and independent scalability within each layer. Critical success factors for this approach include metadata-driven architecture that can deliver a clear understanding of metadata across the different layers and a well-defined data strategy and solution architecture and roadmap. It also includes a clear understanding of system performance and architecture. In addition, this method offers defined goals for performance and service-level optimization.
Organizations that are beginning to think about workload-driven approaches for their data warehouse should ensure that all of their architecture teams are aligned and ready to define the big picture. If you have a thought or a question, please share it in the comments.
This article by Krish Krishnan, president and CEO at Sixth Sense Advisors, Inc., is offered for publication in association with the Big Data and Enterprise Architecture Conference 2013, November 20–22 in Washington, D.C., sponsored by Data Management Forum.