Friday, April 2, 2010

BI Dimensional Modeling - Data Marts

Dimensional Modeling is the process of designing and constructing data marts.

When you extract the data from the Operational Source Systems (read What is BI? to understand the basic elements of a BI Application) you need to make the difference between measurements (facts) and measurement characteristics (dimensions).

To decide if a certain field is a fact or a dimension, ask yourself the following questions:
  • Does it change frequently and it participates in calculations? If yes, than you have a fact.
  • Is it rather constant and participates in query constrains ? If yes, than you have a dimension attribute.
For each business process that you decide to model you store all the found facts in a fact table, and assemble dimension attributes into dimension tables. Your first concern should be to decide which are the business processes of the business, than choose one process at a time, starting with the most important one, and find the facts (fact table attributes) and the dimensions and their attributes. 

If the business is "grocery store retail sales" one of the most important business process is "POS retail sales". Two of the facts for this process are: quantity sold and dollar sales amount. The dimensions could be date, product, and store. Some of the product dimension attributes could be: product description, brand description, package size, etc.

At the intersection of the date, product and store dimensions you find a certain product that was sold in a certain day in a certain store. If you look in the fact table you find the quantity sold of that product and the dollar amount obtained from the sale. By dividing the dollar amount with the quantity sold you find out the price of one product which is rather constant so it is a dimension attribute. You observe that we choose to store as a fact (fact table attribute), the dollar amount which could be computed by multiplying sales quantity and product price. This is of example of  trading  space to speed by reducing computation overhead.

Fact and dimension tables in a dimensional model (The Data Warehouse Toolkit by Ralph Kimball and Margy Ross)

As seen in the figure above, the fact table contains foreign keys to all the dimension tables. A good decision is to use surrogate keys for all fact and dimension tables that do not have business meaning. 

So a data mart models one business process by constructing a star schema composed of a fact table and surrounding dimension tables. Sometimes there is a need to construct more star schemas in order to respond to all possible queries about a business process. TODO (discuss about types of grains: transaction, snapshot, accumulating snapshot)

Here are some characteristics of facts and dimensions.

Fact tables
  •  Fact tables stores mainly additive numeric measurements, but it can also store semi-additive and non additive measurements.
  •  The fact table is in a many to many relationship to its dimension tables (Ex: one measurement can refer different dates and one date can be referred by different measurements).
  • In a dimensional database only facts have composite primary keys, usually composed by the primary keys to the dimension tables. A unique combination of dimension values should point to a single measurement. 
Dimension Tables:
  • Dimension table attributes give the query constraints, groupings and report labels.
  • Dimension tables usually have many textual descriptive columns (can have 50-100 columns).Dimension attributes should be verbose business terminology because they implement the interface to the data warehouse
  • Dimensions are the entry points in a fact table, without them a fact table can not be interpreted.
  • Dimension tables have a single primary key which should be a surrogate key.
  • Dimensions should not be normalized.
Why use Dimensional Modeling?
  •  The dimensions being not normalized results in better user understandability of the design. A normalization example would be storing product brands in a different table and not duplicating data in the product dimension, but linking to the new brands table. Taking this approach to the hole would make it hard to understand for the users.
  • Query performance is also achieved by not normalizing the dimension tables. The space loss is not so great because dimensions usually do not have a large number of rows and take only about 10% of the total data warehouse space.
  • If good granularity is considered for the dimensions, the data warehouse becomes resilient to change, in the sense that it is prepared to answer queries that have not yet been asked. 

1 comment:

  1. Thanks, this was a useful briefing on dimension and fact tables.

    ReplyDelete