Building Fast Data Warehouse Schemas: Part 1
Optimize your data warehouse design, starting with the fact tables
Why build a dimensional data model or star schema for a data warehouse? Why not use your existing transaction system? In this and the next couple of articles, we will answer those questions by taking a look at how to design a data warehouse schema.
The goal of a data warehouse is to give users fast access to data so they can make better business decisions. The data needs to be correct, and users need to be able to query and analyze it. In fact, the more users are able to analyze data (I call this playing with data), the better they learn and gain new insight.
These goals are not easy to accomplish with a regular transactional operational database, where the priority is to get records entered fast. You don’t want users burning up CPU cycles running queries and analysis on an operational database, and you don’t want those queries locking records that need to be open for data entry. On one of my very first data warehouse jobs, we decided to test our query tool against the operational system. Within minutes we got a call from the DBA asking why we were locking so many records in the database!
Data warehouses give users a separate area where they can do queries and play with data without affecting the operational database. And now that we’ve created a separate copy of data for the sole purpose of analysis, we can look at other ways to optimize it for the intended function, starting with how it’s organized.
Just the facts
Dimensional modeling is the design methodology used to organize the data in the data warehouse. There are three key parts of a dimensional model: the fact table, the dimension tables, and the summary tables.
Facts are things that you measure and record. Typically, a fact is a number. When I look at a new schema, the first thing I do is highlight all the columns that are decimal or number. That is my starting point for determining what the facts are in a schema.Figure 1: Transactional schema
Dimensions are reference information—they provide descriptive information about the facts. Let’s take a very simple example, as seen in Figure 1. This is a billing system with a table called bill, a table called payment, and a table called apply payment. In the bill table, we have information about what was on the bill and the bill amount and the quantity. In the payment table, we have information about the payment that came in and a payment amount. In the apply payment table, we have information about which payment was applied to which bill. There are various combinations of payments for paying bills: one payment could pay multiple bills or pay only a part of a bill, or three payments could pay one bill.
This schema has four fields that are facts: quantity, bill amount, apply amount, and pay amount. All other fields are dimensional information describing the transactions.Figure 2: Facts and dimensions
Figure 2 shows what this example might look like in a data warehouse schema. There is a fact table that contains the facts. The payment dimension table describes things like the check number and the bank into which the check was deposited. A customer dimension table describes the customer, a product dimension table describes the product that was there, the bill dimension table describes the bill (bill date, terms, and so forth), and almost every data warehouse has a time dimension.
Basically what you’re out to do in dimensional modeling is transform the schema from a transactional system (Figure 1), to something like Figure 2, where you have a dimensional model in place.
But the end users who are analyzing data do not want to see a schema. They want to see what I call “the big spreadsheet in the sky.” They don’t care about dimensions or about star schemas. What they want to see is a huge spreadsheet with all the data adding up correctly at the bottom.
This is the main reason we use a dimensional model: it is easier to create this spreadsheet in one query. If you do a SELECT from every table and every dimension in your dimensional model, you would get the big spreadsheet in the sky (see Figure 3).Figure 3: The user view—a big spreadsheet
The best way to describe the result of a dimensional model to people who are new to it is to view it as a big spreadsheet that has every dimension field and every fact field pre-joined together. It is then possible to work backward from that view to figure out which elements are facts and which are dimensions.
Fact table design
A fact table contains the numbers. It is central to the data warehouse. In addition to numbers, a fact table has a key to all the dimensional tables, so you can join from the fact table to the dimension tables.
Facts are number fields that can be added up most of the time. Some facts are different every time you measure them, such as temperature. I worked on an interesting data warehouse for a company that was manufacturing an item very sensitive to room temperature. The company needed to anticipate three days in advance what the temperature would be at the time of production, or the product would be defective. So the fact they were most interested in was a projection of what the temperature would be in three days, based on temperature changes over the last three days.
A key concept in building the fact table is to decide on the “grain,” or granularity, of the fact. Grain is the level of detail that each record in the fact table will have, and there are a couple ways of looking at it. One type is a snapshot grain. A snapshot grain is usually something like an end-of-day snapshot, a monthly snapshot, or a weekly snapshot. The data is accurate to a point in time and static after that point in time.
The second type of grain is a transactional grain, which shows every little transaction that goes on. While a snapshot grain might show the state of a customer at the end of the day, a transactional grain will show every debit and credit transaction that took place with the customer over time.
The goal is to get the lowest possible grain, so always capture records at the lowest transactional level. It is much easier and faster to summarize than it is to reload a lower level of detail. You cannot drill down into detail when you don’t have the data, so it is far better to put the lowest level of detail in the data warehouse from the beginning.
The starting point for a data warehouse design is the fact table, and you should now have an overview of what to look for and include in a fact table. In the next issue, we will look at dimension and summary tables.
|[followbutton username='IBMdatamag' count='false' lang='en' theme='light']|