Slowly changing
dimensions are the dimensions in which the data changes slowly, rather than
changing regularly on a time basis.
For example, you may have a customer dimension in a retail domain. Let say the customer is in India and every month he does some shopping. Now creating the sales report for the customers is easy. Now assume that the customer is transferred to United States and he does shopping there.
There are many approaches how to deal
with SCD. The most popular are:
Type 0 - The passive method
Type 1 - Overwriting the old
value
Type 2 - Creating a new
additional record
Type 3 - Adding a new column
Type 4 - Using historical table
Type 6 - Combine approaches of
types 1,2,3 (1+2+3=6)
Type 0 - The passive method. In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.
Type 0 - The passive method. In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.
How to record such a change in your customer dimension?
You could sum or average the sales done by the customers. In this case you won't get the exact comparison of the sales done by the customers. As the customer salary is increased after the transfer, he/she might do more shopping in United States compared to in India. If you sum the total sales, then the sales done by the customer might look stronger even if it is good. You can create a second customer record and treat the transferred customer as the new customer. However this will create problems too.
You could sum or average the sales done by the customers. In this case you won't get the exact comparison of the sales done by the customers. As the customer salary is increased after the transfer, he/she might do more shopping in United States compared to in India. If you sum the total sales, then the sales done by the customer might look stronger even if it is good. You can create a second customer record and treat the transferred customer as the new customer. However this will create problems too.
Handling these issues involves 3 main SCD management methodologies.They are:
Slowly changing dimension
Type 1: SCD type 1 methodology is used when there is
no need to store historical data in the dimension table. This method overwrites
the old data in the dimension table with the new data. It is used to correct
data errors in the dimension.
As an example, i have the customer table with the below data.
As an example, i have the customer table with the below data.
surrogate_key customer_id
customer_name Location
------------------------------------------------------------------------------
1
1 Marspton
Illions
Here the customer name is misspell. It should be Marston instead of Marspton. If you use type1 method, it just simply overwrites the data. The data in the updated table will be.
surrogate_key customer_id
customer_name Location
------------------------------------------------------------------------------
1
1 Marston
Illions
The advantage of type1 is ease of maintenance and less space occupied. The disadvantage is that there is no historical data kept in the data warehouse.
Slowly changing dimension
Type 2: SCD type 2 stores the entire history the
data in the dimension table. With type 2 we can store unlimited history in the
dimension table. In type 2, you can store the data in three different ways.
They are
- Versioning
- Flagging
- Effective
Date
SCD Type 2
Versioning:
In versioning method, a sequence number is used to represent the change. The
latest sequence number always represents the current row and the previous
sequence numbers represents the past data.
As an example, let’s use the same example of customer who changes the location. Initially the customer is in Illions location and the data in dimension table will look as.
surrogate_key customer_id
customer_name Location Version
-------------------------------------------------------------------------------------------
1
1
Marston
Illions 1
The customer moves from Illions to Seattle and the version number will be incremented. The dimension table will look as
surrogate_key customer_id
customer_name Location Version
-------------------------------------------------------------------------------------------
1
1
Marston
Illions 1
2
1
Marston
Seattle 2
Now again if the customer is moved to another location, a new record will be inserted into the dimension table with the next version number.
SCD Type 2 Flagging: In flagging method, a flag column is created in the dimension table. The current record will have the flag value as 1 and the previous records will have the flag as 0.
Now for the first time, the customer dimension will look as.
surrogate_key customer_id
customer_name Location Flag
---------------------------------------------------------------------------------------
1
1
Marston
Illions 1
Now when the customer moves to a new location, the old records will be updated with flag value as 0 and the latest record will have the flag value as 1.
surrogate_key customer_id customer_name
Location Version
-------------------------------------------------------------------------------------------
1
1
Marston
Illions 0
2 1 Marston Seattle 1
2 1 Marston Seattle 1
SCD Type 2 Effective Date: In Effective Date method, the period of the change is tracked using the start_date and end_date columns in the dimension table.
surrogate_key customer_id
customer_name Location Start_date
End_date
----------------------------------------------------------------------------------------------------------------
1
1
Marston
Illions 01/03/2010 20/02/2011
2
1
Marston
Seattle 21/02/2011 NULL
The NULL in the End_Date indicates the current version of the data and the remaining records indicate the past data.
Slowly changing dimension
Type 3: In type 3 method, only the current
status and previous status of the row is maintained in the table. To track
these changes two separate columns are created in the table. The customer dimension
table in the type 3 method will look as
surrogate_key customer_id
customer_name Current_Location previous_location
---------------------------------------------------------------------------------------------------------------------
1
1
Marston Illions
NULL
Let say, the customer moves from Illions to Seattle and the updated table will look as
surrogate_key customer_id
customer_name Current_Location previous_location
---------------------------------------------------------------------------------------------------------------------
1
1
Marston Seattle
Illions
Now again if the customer moves from seattle to NewYork, then the updated table will be
surrogate_key customer_id
customer_name Current_Location previous_location
---------------------------------------------------------------------------------------------------------------------
1
1
Marston NewYork
Seattle
The type 3 method will have limited history and it depends on the number of columns you create.
Type 4 - Using historical table. In
this method a separate historical table is used to track all dimension's
attribute historical changes for each of the dimension. The 'main' dimension
table keeps only the current data e.g. customer and customer_history
tables.
Current table:
Current table:
Customer_ID
|
Customer_Name
|
Customer_Type
|
1
|
Cust_1
|
Corporate
|
Historical table:
Customer_ID
|
Customer_Name
|
Customer_Type
|
Start_Date
|
End_Date
|
1
|
Cust_1
|
Retail
|
01-01-2010
|
21-07-2010
|
1
|
Cust_1
|
Oher
|
22-07-2010
|
17-05-2012
|
1
|
Cust_1
|
Corporate
|
18-05-2012
|
31-12-9999
|
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:
current_type - for keeping current value of
the attribute. All history records for given item of attribute have the same
current value.
historical_type - for keeping historical
value of the attribute. All history records for given item of attribute could
have different values.
start_date - for keeping start date of
'effective date' of attribute's history.
end_date - for keeping end date of 'effective
date' of attribute's history.
current_flag - for keeping information about the most recent record.
In this method to capture attribute change we add a new record as in type 2. The current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3.
In this method to capture attribute change we add a new record as in type 2. The current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3.
Customer_ID
|
Customer_Name
|
Current_Type
|
Historical_Type
|
Start_Date
|
End_Date
|
Current_Flag
|
1
|
Cust_1
|
Corporate
|
Retail
|
01-01-2010
|
21-07-2010
|
N
|
2
|
Cust_1
|
Corporate
|
Other
|
22-07-2010
|
17-05-2012
|
N
|
3
|
Cust_1
|
Corporate
|
Corporate
|
18-05-2012
|
31-12-9999
|
Y
|
No comments:
Post a Comment