Sunday, August 1, 2010

WebServices Dictionary

SAAJ = SOAP with Attachments API for Java
SOAP = Simple Object Access Protocol, is a protocol specification for exchanging XML messages in the implementation of Web Services. It usually relies on other Application Layer protocols (most notably Remote Procedure Call (RPC) and HTTP) for message negotiation and transmission

Friday, April 2, 2010

BI Dimensional Modeling - Design Techniques

In this post I will present some basic techniques and rules for successful dimensional modeling. For a basic knowledge on Dimensional Modeling read: BI Dimensional Modeling - Data Marts.

Four-step technique for designing data marts
  1. Select the business process to model. The main idea behind the BI concept is combining data from different departments in the organization into a unique standardized warehouse. When you design the data warehouse you need to be aware of this idea and understand that a business processes should not belong to a certain organizational business department. So if you design a data mart for handling the order business process, it will be used in the sales department as well as in the marketing department.
  2. Declare the grain of the business process.  
    • Answer the question: "How do you describe a single row in the fact table?" Dimensions should not necessarily help define it. It could represent an individual line item on a POS transaction, the sold amount of a product per day, etc.
    • Choose the lowest grain even if such a detail level is not necessary right away. A good grain is how resilience to change is obtained in a data warehouse. Choose a higher grain only when you know for sure that you won't ever need a more granular detail level. 
  1. Choose the dimensions. 
    • Answer the question: "How do business people describe the data that results from the business process?"
    • Dimension attributes should be all possible descriptions that take single values in the context of a measurement.
    • The grain determines all the primary dimensions. Any other additional dimensions should take only one value for a combination of the primary dimensions.
  1. Identify the facts. Now that you've decided which are the grain and the dimensions of the data mart you should identify the measurements that you want to store. 
    • Answer the question: "What are we measuring?". 
    • It's better to store calculated fields (Ex: store directly the total dollar amount do not expect that the user will compute it by multiplying quantity and unit price). In this way we assure that the value is calculated correct and all the users refer to it consistently. 
    • Usually facts should be addable, if not they should be considered candidates for dimension attributes. It does not make sense to add product unit price, so the unit price will be an attribute in the product dimension.
Conformed dimensions technique - Data Bus Matrix 

     

    BI Dimensional Modeling - Data Marts

    Dimensional Modeling is the process of designing and constructing data marts.

    When you extract the data from the Operational Source Systems (read What is BI? to understand the basic elements of a BI Application) you need to make the difference between measurements (facts) and measurement characteristics (dimensions).

    To decide if a certain field is a fact or a dimension, ask yourself the following questions:
    • Does it change frequently and it participates in calculations? If yes, than you have a fact.
    • Is it rather constant and participates in query constrains ? If yes, than you have a dimension attribute.
    For each business process that you decide to model you store all the found facts in a fact table, and assemble dimension attributes into dimension tables. Your first concern should be to decide which are the business processes of the business, than choose one process at a time, starting with the most important one, and find the facts (fact table attributes) and the dimensions and their attributes. 

    If the business is "grocery store retail sales" one of the most important business process is "POS retail sales". Two of the facts for this process are: quantity sold and dollar sales amount. The dimensions could be date, product, and store. Some of the product dimension attributes could be: product description, brand description, package size, etc.

    At the intersection of the date, product and store dimensions you find a certain product that was sold in a certain day in a certain store. If you look in the fact table you find the quantity sold of that product and the dollar amount obtained from the sale. By dividing the dollar amount with the quantity sold you find out the price of one product which is rather constant so it is a dimension attribute. You observe that we choose to store as a fact (fact table attribute), the dollar amount which could be computed by multiplying sales quantity and product price. This is of example of  trading  space to speed by reducing computation overhead.

    Fact and dimension tables in a dimensional model (The Data Warehouse Toolkit by Ralph Kimball and Margy Ross)

    As seen in the figure above, the fact table contains foreign keys to all the dimension tables. A good decision is to use surrogate keys for all fact and dimension tables that do not have business meaning. 

    So a data mart models one business process by constructing a star schema composed of a fact table and surrounding dimension tables. Sometimes there is a need to construct more star schemas in order to respond to all possible queries about a business process. TODO (discuss about types of grains: transaction, snapshot, accumulating snapshot)

    Here are some characteristics of facts and dimensions.

    Fact tables
    •  Fact tables stores mainly additive numeric measurements, but it can also store semi-additive and non additive measurements.
    •  The fact table is in a many to many relationship to its dimension tables (Ex: one measurement can refer different dates and one date can be referred by different measurements).
    • In a dimensional database only facts have composite primary keys, usually composed by the primary keys to the dimension tables. A unique combination of dimension values should point to a single measurement. 
    Dimension Tables:
    • Dimension table attributes give the query constraints, groupings and report labels.
    • Dimension tables usually have many textual descriptive columns (can have 50-100 columns).Dimension attributes should be verbose business terminology because they implement the interface to the data warehouse
    • Dimensions are the entry points in a fact table, without them a fact table can not be interpreted.
    • Dimension tables have a single primary key which should be a surrogate key.
    • Dimensions should not be normalized.
    Why use Dimensional Modeling?
    •  The dimensions being not normalized results in better user understandability of the design. A normalization example would be storing product brands in a different table and not duplicating data in the product dimension, but linking to the new brands table. Taking this approach to the hole would make it hard to understand for the users.
    • Query performance is also achieved by not normalizing the dimension tables. The space loss is not so great because dimensions usually do not have a large number of rows and take only about 10% of the total data warehouse space.
    • If good granularity is considered for the dimensions, the data warehouse becomes resilient to change, in the sense that it is prepared to answer queries that have not yet been asked. 

    Thursday, April 1, 2010

    BI Terminology Dictionary

    Data mart = Dimensional schema for storing data as facts (or measurements) characterized by a set of dimension values that describe the fact. In a relational database, data marts are often called star schemas, because they are represented by a fact table joined to a number of dimension tables around it, which resembles a star.A data mart usually models a single business process.

    OLAP Cube = An OLAP (Online Analytical Processing) object built onto a data mart stored in the database and presented through an OLAP Server like Mondrian, MS Analysis Services, Oracle (nee Hyperion) Essbase.

    ERP (Enterprise Resource Planning) = Application which integrates a community’s data into a single platform and facilitates access to this data. Examples: SAP ERP, Compiere.

    Retail sales = Selling "en detail" in shops or online.

    POS = Point-of-sale

    Data warehouse = A repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis

    What is BI?

    BI stands for Business Intelligence. It describes how a data warehouse is built and how these data is queried and analysed.

    A BI Application allows easy access to an organization's business data assuring data consistency and a powerful way of querying these data. It aims at helping management in decision making by offering a constant view on the status of their organization. More sophisticated BI Applications offer data mining, forecasting and modeling support.

    The reasons for which a Reporting Application would not tackle this job, are:
    • multiple data sources 
    • very large quantity of data 
    • requirement to query also historical data.   
    Multiple data sources made it necessary to design a place where to combine and standardize the data. The issue of querying a very large quantity of data was solved by storing the data in dimensional schemas. A dimensional schema thinks of data as facts or measurements described by a series of dimensions. It often trades space to query speed. Historical data is stored offering the capability of showing the company's evolution.

    A BI Application extracts data from Operational Source Systems (meaning the applications that capture the transactions of a business - an ERP for example), stores the data in a Staging Area where it is transformed (i.e cleaned, combined and standardized) and than loads the data from the Staging Area into the Presentation Area where it is stored in Data Marts or Cubes. This process is named ETL, acronym standing for extract - transformation - load. An Access Tool offers query support on the Data Marts or Cubes.

    Basic elements of a data warehouse (The Data Warehouse Toolkit by Ralph Kimball and Margy Ross)

    For details on Dimensional Modeling explaining how Data Presentation Area is built, read: BI Dimensional Modeling - Data Marts, BI Dimensional Modeling -Design Techniques