Tuesday, 1 November 2016

Data Warehouse Concept

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.


How many stages in Data warehousing?

Data warehouse generally includes two stages
1. ETL
2. Report Generation

ETL
Short for extract, transform, load, three database functions that are combined into one tool
·         Extract — the process of reading data from a source database.
·         Transform — the process of converting the extracted data from its previous form into required form.
·         Load — the process of writing the data into the target database.

ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format to another format
It is used to retrieve the data from various operational databases and is transformed into useful information and finally loaded into Data warehousing system
1.     Informatica
2.     Abinito
3.     Datastage
4.     Bodi
5.     Oracle Warehouse Builders

Report generation
In report generation, OLAP is used (i.e.) online analytical processing. It is a set of specification which allows the client applications in retrieving the data for analytical processing
It is a specialized tool that sits between a database and user in order to provide various analyses of the data stored in the database.
OLAP Tool is a reporting tool which generates the reports that are useful for Decision support for top level management
1.     Business Objects
2.     Cognos
3.     Micro strategy
4.     Hyperion
5.     Oracle Express
6.     Microsoft Analysis Services

 

What are the types of datawarehousing?

EDW (Enterprise data warehousing)
·         It provides a central database for decision support throughout the enterprise
·         It is a collection of DATAMARTS

DATAMART
·         It is a subset of Data warehousing.
·         It is a subject oriented database which supports the needs of individuals depts. in an organizations.
·         It is called high performance query structure.
·         It supports particular line of business like sales, marketing etc…

ODS (Operational data store)
·         It is defined as an integrated view of operational database designed to support operational monitoring
·         It is a collection of operational data sources designed to support Transaction processing
·         Data is refreshed near real-time and used for business activity
·         It is an intermediate between the OLTP and OLAP which helps to create an instance reports.

 

What are the types of Approach in DWH?

Top down approach: first we need to develop EDW then form that EDW we develop data mart
OLTP  —>  ETL  —>  DWH  —>  Data mart  —>  OLAP
Advantages 
·         Cost of initial planning & design is high
·         Takes longer duration of more than an year

Bottom up approach: first we need to develop data mart then we integrate these data mart into EDW
OLTP  —>  ETL  —>  Data mart  —>  DWH  —>  OLAP
Advantages 
·         Planning & Designing the Data Marts without waiting for the Global warehouse design
·         Immediate results from the data marts
·         Tends to take less time to implement
·         Errors in critical modules are detected earlier.
·         Benefits are realized in the early phases.
·         It is a Best Approach.

Why need staging area for DWH?

1. Staging area needs to clean operational data before loading into data warehouse.
2. Cleaning in the sense your merging data which comes from different source.
3. it’s the area where most of the ETL is done

Data Cleansing
·         It is used  to remove duplication’s
·         It is used to correct wrong email addresses
·         It is used to identify missing data
·         It used to convert the data types
·         It is used to capitalize name & addresses. 

Types of systems

Data Mart: - A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data.
  • It is a high performance query structure(HPQS)
  • Fast retrieval of data is possible through Datamart


Differences between a Data Warehouse and a Data Mart
Category
Data Warehouse
Data Mart
Scope
Corporate
Line of Business (LOB)
Subject
Multiple
Single subject
Data Sources
Many
Few
Size (typical)
100 GB-TB+
< 100 GB
Implementation Time
Months to years
Months

Metadata is data that describes other data. Meta is a prefix that in most information technology usages means "an underlying definition or description." Metadata summarizes basic information about data, which can make finding and working with particular instances of data easier.

Data integration means combining Data coming from different sources and providing users with a unified view of these data.

Dimensional vs. normalized approach for storage of data
There are three or more leading approaches to storing data in a data warehouse — the most important approaches are the dimensional approach and the normalized approach.

The dimensional approach, whose supporters are referred to as “Kimballites”, believe in Ralph Kimball’s approach in which it is stated that the data warehouse should be modeled using a Dimensional Model/star schema. The normalized approach, also called the 3NF model (Third Normal Form), whose supporters are referred to as “Inmonites”, believe in Bill Inmon's approach in which it is stated that the data warehouse should be modeled using an E-R model/normalized model.

In a dimensional approach, transaction data are partitioned into "facts", which are generally numeric transaction data, and "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.
Features of Data Warehouse:
  • A DWH is designed to support decision making process. Hence it is known as Decision Support System(DSS)
  • It analyzes the business transactions in order to support decision making
  • DWH is a container data
  • DWH is the process of developing a data warehouse
  • DWH is a read only database because it is designed to read the data for analysis but not for transactional processing
  • DWH is a historical database because it can store historical business information
  • Father of Data warehousing   W.H.Inmon. In 1987, he designed a data warehouse.

No comments:

Post a Comment

Big Data FQA

1) What is a Big Data Architecture? Generally speaking, a Big Data Architecture is one which involves the storing and processing of data...