Building Fast Data Warehouse Schemas: Part 2

Optimize the design of your data warehouse dimension tables

My last column discussed why and how we use dimensional modeling to organize the data in a data warehouse. To briefly reiterate, the goal of a data warehouse is to give users fast access to data so they can make better business decisions, and to give them a separate area to run queries and to play with data that doesn’t affect the operational database. Dimensional modeling helps us optimize our warehouse data for queries and analysis. A dimensional model has three key parts: the fact tables, the dimension tables, and the summary tables. We have already covered fact tables, so in this column we will look at how to design dimension tables.

Dimension tables: An introduction

A fact table contains things that you measure and record—typically numbers. A dimension table contains descriptive information about facts. A dimension table should contain no facts, only metadata. If a billing system fact table contains item quantity, amount paid, and amount due, its related dimension tables might include product information, customer information, and time or date information. Include as much descriptive information as possible! Useful dimensions have data that describes the who, when, where, what, and why of your facts.

Dimension hierarchies

One of the most important design decisions you face is how to capture hierarchy information in a dimension table. Hierarchies are the logical structures that an organization creates to describe relationships among its data. For example, a product may belong to a product group, which then belongs to a brand, which then belongs to a division. A geographical hierarchy for a customer record may start at the street address and go up through the town, state, zip code, and country. Or, the dimension table may need to account for data from a geographic information system (GIS). Business users often analyze data by comparing facts summarized at different levels of a hierarchy. Dimension table design is an essential part of enabling users to move easily through the data, drilling in and out to see what is going on at different levels. Organizations often categorize their data in several ways, so you may need to account for multiple independent hierarchies. For example, if you have two ways of rolling up products, then you need to support both hierarchies in the dimension tables (see Table 1). I once worked with an organization that had independent offices that reported to regional offices. However, not every state had regional offices. Because the organization needed to be able to analyze data by region and state, we had to support two hierarchies—states that had regional offices, and states that didn’t—in our dimension tables. There are no magic shortcuts to designing dimension hierarchies; it requires working with your business users and understanding their data. A whiteboard and a group meeting are the best tools for this process. Not all dimensions will come from your source systems. You may purchase some dimension data, such as customer demographic data, from external suppliers, and some you may need to add manually. For example, most data warehouses have a time dimension, but this almost always must be created independently. For nearly every data warehouse system I build, I start with a spreadsheet and enter days for the next 10–20 years and the past 10–20 years, depending on how far back data will be loaded. I then add time periods such as the day of the week, holidays, fiscal periods, and seasons, building a hierarchy so people can slice and dice the data easily by time periods.

Changing data in dimension tables

As you’re designing your dimension table, you need to plan for changing dimensional data. When a product name changes, do you show the name of the product when it was sold or the new product name? When a customer moves, do you show their address at the time they bought the product or their current address? There are three established methods of dealing with changing dimensions. First, you can simply overwrite the data: your data will always be current, but you lose historical context. The second method is to insert a new record for each change. This gives you a more accurate picture, but you must relate the old and the new records. One compromise method I have used is to insert a new record for a change only once a year or at specific time periods. Add a date field to both your dimension and fact tables when you do this, so you can see what they look like at the end of a chosen time period. The third method is to add a field to the dimension table that captures old values. This method requires a schema alteration every time a record changes, because you’re going to be adding one new field for each change. To capture previous values, some sites add three or four fields when they build the dimension table. This field-adding method works best for data that’s only likely to change once or twice—such as a last name.

Table 1: This dimension table defines two hierarchies: product and market.

Design tips and techniques

Here are a few things to remember when creating dimension tables:

  • A dimension table needs a generalized key as the primary key data; this becomes the foreign key in the fact table. Informix serial numbers work really well as generalized keys, and I am a fan of using a serial number column on all dimension tables.
  • You may have good descriptive information (such as notes or comments) about a transaction that you do not want to lose, but that isn’t useful for analyzing the transaction data. Putting the descriptive information in the fact table will increase the table size and slow down queries. Instead, create a transaction dimension table with one record for each record in the fact table. That way, you can access the data and keep it out of the way.
  • As you are building your data warehouse, think about which dimensions can be reused in other data warehouses. These are called conforming dimensions, and they are an early step to building an enterprise data warehouse.
  • Generally speaking, a dimension table should be denormalized and flat, with as many columns as needed. Don’t worry about disk savings here: performance and ease of use are your main concerns.

Another design approach is a snowflake dimension, in which you normalize the dimensions to create dimension tables that link to other dimension tables. Most of the time, they just cause problems for people doing queries. One advantage of the snowflake dimension is the space savings, because you normalize your dimension tables. The disadvantage is that there are more tables to join in building a query, and users are less likely to use them. The simpler it is for end users to build a query, the more they will use your data warehouse, and the more business benefit it will provide.

Next up: Avoid the data garbage dump

In my next column, we will look at summary tables. Summary tables are required in a data warehouse for fast query processing and to allow users to drill up and down into the facts. They can make the difference between a successful data warehouse and a “data garbage dump,” where gigabytes of data are stored but unusable. [followbutton username='lesterknutsen' count='false' lang='en' theme='light']

[followbutton username='IBMdatamag' count='false' lang='en' theme='light']