Data Warehousing Fundamentals
Prepared by Soheib Iraj
There are two pioneers when it comes to choosing a data warehouse design introduced by Bill Inmon and Ralph Kimball. Each approach serves a different purpose and there is no right or wrong way. The table below is a short summary of each approach:
|Inmon Enterprise Data Warehouse||Kimball Dimensional Modelling|
|Top-down approach||Bottom-up approach|
|One big giant enterprise data warehouse||Multiple smaller sub-sets that together make the data warehouse|
|Normalized Schema- 3NF||Denormalized Schema|
|The Initial Setup requires more time and effort||The Initial Setup is easier to start|
|More suitable for business where the processes are well defined and stable||Suitable for smaller business where the processes are more agile|
In this post, I will focus more on Kimball’s approach as it is more common and fits smaller size businesses. Also, the initial setup is quick and the logic used in the data modelling is easy to understand.
Dimensional schema designs
A star schema is a type of relational database schema that is composed of a single, central fact table that is surrounded by dimension tables, but it can have more than one fact table with its associated dimensions.
- Has a single table for each dimension
- Each table supports all attributes for that dimension
- Simple Query. Fast aggregations. Feeds cubes. Good performance. Simplified business logic.
- It has fact tables linked to associated dimension tables via primary/foreign key relationships looking like a start
- Typically a de-normalized solution used in OLAP systems with cubes.
Disadvantage: Can Impact the ETL times
Advantage: Because it is de-normalized, retrieving information is quicker and better performance
The snowflake schema consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many-to-one relationship.
- Dimension tables are normalized
- Typically contains multiple tables per dimension
- Each table contains dimension key, value, and the foreign key value for that parent
- The normalization of Dimension tables result in saving storage
Kind of the opposite of Star Schema
Faster ETL process but slower retrieval performance because of the joins in multiple dimension tables.
A starflake schema is a combination of a star schema and a snowflake schema. Starflake schemas are snowflake schemas where only some of the dimension tables have been denormalized. The Dimension tables are to some degree normalized to save some space but the fact tables are kept denormalized to improve the query performance.
This is where one or two fact tables share one or more Dimension table. This is to save space.
Dimensional modelling process
- Choose the business process
- Declare the grain
- Identify the dimensions
- Identify the fact
Choose the business process
The process of dimensional modelling builds on a 4-step design method that helps to ensure the usability of the dimensional model and the use of the data warehouse. The basics in the design-build on the actual business process which the data warehouse should cover. Therefore, the first step in the model is to describe the business process which the model builds on.
Declare the grain
After describing the business process, the next step in the design is to declare the grain of the model. The grain of the model is the exact description of what the dimensional model should be focusing on. This could, for instance, be “An individual line item on a customer slip from a retail store”. To clarify what the grain means, you should pick the central process and describe it with one sentence. Furthermore, the grain (sentence) is what you are going to build your dimensions and fact table from. You might find it necessary to go back to this step to alter the grain due to new information gained on what your model is supposed to be able to deliver.
Identify the dimensions
The third step in the design process is to define the dimensions of the model. The dimensions must be defined within the grain from the second step of the 4-step process. Dimensions are the foundation of the fact table and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory, etc… These dimensions are where all the data is stored. For example, the date dimension could contain data such as year, month and weekday.
Identify the facts
After defining the dimensions, the next step in the process is to make keys for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system since this is where they get access to data stored in the data warehouse. Therefore, most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc.
Dimensional modelling concepts
- Fact tables and entities
A fact table or a fact entity is a table or entity in a star or snowflake schema that stores measures that measure the business, such as sales, cost of goods, or profit and they are almost always numeric.
- Dimension tables and entities
A dimension table or dimension entity is a table or entity in a star, snowflake, or starflake schema that stores details about the facts. For example, a Time dimension table stores the various aspects of time such as year, quarter, month, and day. Dimension table store descriptive information about the numerical values in a fact table
A hierarchy is a many-to-one relationship between members of a table or between tables. A hierarchy basically consists of different levels, each corresponding to a dimension attribute.
An outrigger is a dimension table or entity that is joined to other dimension tables in a star schema. Outriggers are used when a dimension table is snowflaked.
Measures define a measurement attribute and are used in fact tables. You can calculate measures by mapping them directly to a numerical value in a column or attribute. An aggregation function summarizes the value of the measure for dimensional analysis.
Fact table types
There are three types of fact tables and entities:
A transaction fact table or transaction fact entity records one row per transaction.
A periodic fact table or periodic fact entity stores one row for a group of transactions that happen over a period of time.
An accumulating fact table or accumulating fact entity stores one row for the entire lifetime of an event. An example of an accumulating fact table or entity records the lifetime of a credit card application from the time it is sent to the time it is accepted.
Slowly Changing Dimension tables types
- Type 0: Retain Original
The dimension attribute value never changes… for example the Original value or customer’s original credit score value
- Type 1: Overwrite
The old attribute value in the dimension row is overwritten with the new value; type 1 attributes always reflect the most recent assignment, and therefore this technique destroys history.
- Type 2: Add New Row
When changes happen, a new row in the dimension will be added with the updated attribute values. This is why we need Surrogate keys and not just the Natural key in the Dimension table. We also need to add three more columns: EffectiveDate, ExpirationDate, IsCurrent flag.
- Type 4: Add New Attribute
Changes add a new attribute in the dimension table to preserve the old attribute value. The new column may be called Alternative reality.
- Type 6: Type 1 + Type 2 Dimension
This is a hybrid technique used to support both types
(1) IBM Knowledge Centre
(2) Kimball Dimensional Modeling Techniques
(1) Wide World Importers sample database v1.0
(2) Adventure Work Database