Fundamentals of Data Warehousing

Data Warehousing Fundamentals

Prepared by Soheib Iraj

There are two pioneers when it comes to choosing a data warehouse design introduced by Bill Inmon and Ralph Kimball. Each approach serves a different purpose and there is no right or wrong way. The table below is a short summary of each approach:

Inmon Enterprise Data Warehouse Kimball Dimensional Modelling
Top-down approachBottom-up approach
One big giant enterprise data warehouseMultiple smaller sub-sets that together make the data warehouse
Normalized Schema- 3NFDenormalized Schema
The Initial Setup requires more time and effortThe Initial Setup is easier to start
More suitable for business where the processes are well defined and stableSuitable for smaller business where the processes are more agile

In this post, I will focus more on Kimball’s approach as it is more common and fits smaller size businesses. Also, the initial setup is quick and the logic used in the data modelling is easy to understand.

 

Dimensional schema designs

[Star Schema]

A star schema is a type of relational database schema that is composed of a single, central fact table that is surrounded by dimension tables, but it can have more than one fact table with its associated dimensions.

  • Has a single table for each dimension
  • Each table supports all attributes for that dimension
  • Simple Query. Fast aggregations. Feeds cubes. Good performance. Simplified business logic.
  • It has fact tables linked to associated dimension tables via primary/foreign key relationships looking like a start
  • Typically a de-normalized solution used in OLAP systems with cubes.

Disadvantage: Can Impact the ETL times

Advantage: Because it is de-normalized, retrieving information is quicker and better performance

[Snowflake Schema]

The snowflake schema consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many-to-one relationship.

  • Dimension tables are normalized
  • Typically contains multiple tables per dimension
  • Each table contains dimension key, value, and the foreign key value for that parent
  • The normalization of Dimension tables result in saving storage

Kind of the opposite of Star Schema

Faster ETL process but slower retrieval performance because of the joins in multiple dimension tables.

[StarFlake Schema]

A starflake schema is a combination of a star schema and a snowflake schema. Starflake schemas are snowflake schemas where only some of the dimension tables have been denormalized. The Dimension tables are to some degree normalized to save some space but the fact tables are kept denormalized to improve the query performance.

[Galaxy Schema]

This is where one or two fact tables share one or more Dimension table. This is to save space.

 

Dimensional modelling process

  1. Choose the business process
  2. Declare the grain
  3. Identify the dimensions
  4. Identify the fact

 

Choose the business process

The process of dimensional modelling builds on a 4-step design method that helps to ensure the usability of the dimensional model and the use of the data warehouse. The basics in the design-build on the actual business process which the data warehouse should cover. Therefore, the first step in the model is to describe the business process which the model builds on.

Declare the grain

After describing the business process, the next step in the design is to declare the grain of the model. The grain of the model is the exact description of what the dimensional model should be focusing on. This could, for instance, be “An individual line item on a customer slip from a retail store”. To clarify what the grain means, you should pick the central process and describe it with one sentence. Furthermore, the grain (sentence) is what you are going to build your dimensions and fact table from. You might find it necessary to go back to this step to alter the grain due to new information gained on what your model is supposed to be able to deliver.

Identify the dimensions

The third step in the design process is to define the dimensions of the model. The dimensions must be defined within the grain from the second step of the 4-step process. Dimensions are the foundation of the fact table and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory, etc… These dimensions are where all the data is stored. For example, the date dimension could contain data such as year, month and weekday.

Identify the facts

After defining the dimensions, the next step in the process is to make keys for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system since this is where they get access to data stored in the data warehouse. Therefore, most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc.

 

Dimensional modelling concepts

  • Fact tables and entities

A fact table or a fact entity is a table or entity in a star or snowflake schema that stores measures that measure the business, such as sales, cost of goods, or profit and they are almost always numeric.

  • Dimension tables and entities

A dimension table or dimension entity is a table or entity in a star, snowflake, or starflake schema that stores details about the facts. For example, a Time dimension table stores the various aspects of time such as year, quarter, month, and day. Dimension table store descriptive information about the numerical values in a fact table

  • Hierarchies

A hierarchy is a many-to-one relationship between members of a table or between tables. A hierarchy basically consists of different levels, each corresponding to a dimension attribute.

  • Outriggers

An outrigger is a dimension table or entity that is joined to other dimension tables in a star schema. Outriggers are used when a dimension table is snowflaked.

  • Measures

Measures define a measurement attribute and are used in fact tables. You can calculate measures by mapping them directly to a numerical value in a column or attribute. An aggregation function summarizes the value of the measure for dimensional analysis.

 

Fact table types

There are three types of fact tables and entities:

  • Transaction

A transaction fact table or transaction fact entity records one row per transaction.

  • Periodic

A periodic fact table or periodic fact entity stores one row for a group of transactions that happen over a period of time.

  • Accumulating

An accumulating fact table or accumulating fact entity stores one row for the entire lifetime of an event. An example of an accumulating fact table or entity records the lifetime of a credit card application from the time it is sent to the time it is accepted.

 

Slowly Changing Dimension tables types

  • Type 0: Retain Original

The dimension attribute value never changes… for example the Original value or customer’s original credit score value

  • Type 1: Overwrite

The old attribute value in the dimension row is overwritten with the new value; type 1 attributes always reflect the most recent assignment, and therefore this technique destroys history.

  • Type 2: Add New Row

When changes happen, a new row in the dimension will be added with the updated attribute values. This is why we need Surrogate keys and not just the Natural key in the Dimension table. We also need to add three more columns: EffectiveDate, ExpirationDate, IsCurrent flag.

  • Type 4: Add New Attribute

Changes add a new attribute in the dimension table to preserve the old attribute value. The new column may be called Alternative reality.

  • Type 6: Type 1 + Type 2 Dimension

This is a hybrid technique used to support both types

 

Sources:

(1)        IBM Knowledge Centre

https://www.ibm.com/support/knowledgecenter/SS9UM9_8.5.0/com.ibm.datatools.dimensional.ui.doc/topics/c_dm_concepts_overview.html

(2)        Kimball Dimensional Modeling Techniques

https://www.ibm.com/support/knowledgecenter/SS9UM9_8.5.0/com.ibm.datatools.dimensional.ui.doc/topics/c_dm_concepts_overview.html

Sample Database(s):

(1)      Wide World Importers sample database v1.0
https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0

(2)        Adventure Work Database

https://msftdbprodsamples.codeplex.com/

https://technet.microsoft.com/en-us/library/ms124623(v=sql.105).aspx#DW

 

SQL Job Scheduler History

Here is an easy way to get list of jobs ran in the past with status. Do you know of a better or simpler way to get the output? If so, please use the comment area to share your code.

 

   1:  CREATE  PROCEDURE [dbo].[pr_check_JobHistory] @dateparam DATETIME = NULL
   2:  
   3:  AS
   4:  
   5:  BEGIN
   6:    IF @dateparam IS NULL
   7:    SET @dateparam = GETDATE();
   8:  
   9:    SELECT sysjobhistory.server,
  10:           sysjobs.name
  11:           AS
  12:           job_name,
  13:           CASE sysjobhistory.run_status
  14:             WHEN 0 THEN 'Failed'
  15:             WHEN 1 THEN 'Succeeded'
  16:             ELSE '???'
  17:           END
  18:           AS
  19:           run_status,
  20:           Isnull(Substring(CONVERT(VARCHAR(8), run_date), 1, 4) + '-' +
  21:                         Substring(CONVERT(VARCHAR
  22:                                   (8), run_date), 5, 2) + '-' +
  23:                  Substring(CONVERT(VARCHAR(
  24:                            8), run_date), 7, 2), '')
  25:           AS
  26:           [Run DATE],
  27:           Isnull(Substring(CONVERT(VARCHAR(7), run_time+1000000), 2, 2) + ':'
  28:                   +
  29:                         Substring(CONVERT(VARCHAR(7), run_time+1000000), 4, 2
  30:                          )
  31:                  +
  32:                  ':' +
  33:                  Substring(CONVERT(VARCHAR(7), run_time+1000000), 6, 2), '')
  34:           AS
  35:           [Run TIME],
  36:           Isnull(Substring(CONVERT(VARCHAR(7), run_duration+1000000), 2, 2) +
  37:                   ':' +
  38:                         Substring(CONVERT(VARCHAR(7), run_duration+1000000),
  39:                         4,
  40:                         2)
  41:                  + ':' +
  42:                  Substring(CONVERT(VARCHAR(7), run_duration+1000000), 6, 2),
  43:           ''
  44:           ) AS
  45:           [Duration],
  46:           sysjobhistory.step_id,
  47:           sysjobhistory.step_name,
  48:           sysjobhistory.MESSAGE
  49:    FROM   msdb.dbo.sysjobhistory
  50:           INNER JOIN msdb.dbo.sysjobs
  51:             ON msdb.dbo.sysjobhistory.job_id = msdb.dbo.sysjobs.job_id
  52:    WHERE  sysjobhistory.run_date = Datepart(yyyy, @dateparam) * 10000 +
  53:                                           Datepart(mm, @dateparam) * 100 +
  54:                                    Datepart
  55:                                    (
  56:                                           dd, @dateparam)
  57:           --AND sysjobs.name = @JobName --remove this line if you want to show all jobs for the specified day
  58:    ORDER  BY instance_id DESC
  59:  END

Error Report Manager: To use Report Builder, you must install .Net Framework 3.5 on this computer

After the successful installation of SQL Server Reporting Services and when you are trying to access the Report Manger via link: http://localhost/Reports an error message is prompted when you click on Report Builder:

image

The error message is:

To use Report Builder, you must install .Net Framework 3.5 on this computer. To visit the Microsoft Download Centre and install .Net Framework 3.5, click install.

image

 

If you already have .Net Framework 3.5 installed, then you do not need to do this. Even if you install it again, it will still prompt you with the same message again.

Solution:

This is a problem with Internet Explorer version 9. To by pass this, simply click on the compatibility icon, then try to access Report Builder again. This time it will work.

image

Now you will see the Report Builder has launched:

image

 

Hope this helpful and send me your questions if you have problems or write in the comments.

Change column type in SQL Server

There are two ways to change the column type in SQL Server:

1- Is by writing a quick code in query script.

2- Is by using the enterprise manger: SSMS (SQL Server Management Studio).

 

Method 1

Run the following code in SQL server:

ALTER TABLE <table name> ALTER COLUMN <column name> DATA TYPE

Example Code:
ALTER TABLE tblExampleTable ALTER COLUMN ID INT

See Microsoft webpage for full Alter Table syntax details:
http://msdn.microsoft.com/en-us/library/ms190273.aspx

Method 2

To do the changes using the second method follow these steps:

a- Open SSMS

b- Navigate to your table which you want to modify column

c- Find column and Right click the choose Modify

image

d- A window will open in Edit mode so that you can change the type of each column:

image

e- Click on Save once you are done.

Note that when you click Save, you might show the below warning message. Saving changes is not permitted:

image

In this case, you can change some settings to allow this save and avoid the warning. To do this simply follow these steps:

1- In SSMS, click on Tools

2- Choose Options

3- Click on the Designers section from the left side menu

4- Un-check the box for: Prevent saving changes that require table re-creation

image

Now try to save again and this time it will work.

 

Move database files .mdf & .ldf to different physical location

Here is a brief and simple way to move your Data files from one location to another.

This method involves using SSMS (SQL Server Management Studio)

1- Open SSMS and locate to database you want to change Data file directory

2- Remember/Note the current Data file directory path. You can do this by Right Clicking on the database and choose Properties. Then, from the left side menu, click on Files to see the Path of your current Data file directory path. Note it down or click on Control + C to copy the path name.

image

3- Detach Database by Right clicking on the database and choosing >> Tasks >> Detach

4- A new window will appear. Click on Drop Connections check box and then click OK.

image

5- Using the windows explorer, go to your Data file directory which you copied or wrote down in step 2:

image

6- Copy/Move the Data File (.MDF) and the Transaction Log file (.LDF) to your new physical location. Note that I always recommend copying first instead of moving in case the file gets corrupted during the move, you will still have the original one to go back to.

image

7- Once you copied the two files, click on Database(s) main folder and choose Attach. Then, from the new opened window, click on Add:

image

8- Navigate to the new path where files are copied and choose the .pdf file and click on OK:

image

9- Now you can see that both the Data File (.MDF) and Transaction Log File (.LDF) are shown in the new location. Click OK and you are done (You need to refresh the Databases for the new attached database to appear again).

image

 

To improve the quality of this user-guide, please help me with some feedback or by simply commenting below.

Report Manager Export in Pipe Delimited Format New Option Adding

In one of the projects that I worked for, I had to export data from the report manager in Pipe Delimited format. None of the export options that come standard include Pipe Delimited. So, here is a solution that I found that might be useful to you as well. I appreciate if you leave some feedback in the comment’s area or if you have any other suggestions.

Solution:

1) The first thing we need to do is to edit the configuration file for reporting services. This file is called: RSReportServer.config

2) By default this configuration file is located in the \Program Files\Microsoft SQL Server\MSSQL.n\Reporting Services\ReportServer

3) Save a copy of the file in case you need to roll back your changes

4) Open the original file in Notepad or a code editor. Do not use Textpad

5) Add the following code to your <Render> node:

<Extension Name="PIPE" Type="Microsoft.ReportingServices.Rendering.DataRenderer.CsvReport,Microsoft.ReportingServices.DataRendering">
        <OverrideNames>
            <Name Language="en-US">TXT (Pipe delimited)</Name>
        </OverrideNames>
        <Configuration>
          <DeviceInfo>
            <FieldDelimiter>|</FieldDelimiter>
            <FileExtension>txt</FileExtension>
          </DeviceInfo>
        </Configuration>
      </Extension>

 

6) Save the file

7) Go to the Report Manager http://localhost/reports and open any report

8) Use the export option and you should see Pipe Delimited listed

image

 

I hope this is helpful to you and don’t forget to leave some feedback.