Sunday, 11 September 2016

ETL Basic interview questions
Q.What is a staging area? Do we need it? 
A.Staging area is place where you hold temporary tables on data warehouse server. Staging tables are connected to work area or fact tables. We basically need staging area to hold the data , and perform data cleansing and merging , before loading the data into warehouse.The staging area exists to be a separate “back room“ or “engine room” of the warehouse where the data can be transformed, corrected and prepared for the warehouse. It should ONLY be accessible to the ETL processes working on the data, or administrators monitoring or managing the ETL process.A typical warehouse generally has three distinct areas:-
1.Several source systems which provide data. This can include databases (Oracle, SQL Server, Sybase etc) or files or spreadsheets
2.A single “staging area” which may use one or more database schemas or file stores (depending upon warehouse load volumes). 
3.One or more “visible” data marts or a single “warehouse presentation area” where data is made visible to end user queries. This is what many people think of as the warehouse although the entire system is the warehouse – it depends upon your perspective.The “staging area” is the middle bit.

Q.What is the purpose of a staging area?
A.
1.One or more database schema(s) or file stores used to “stage” data extracted from the source OLTP systems prior to being published to the “warehouse” where it is visible to end users.
2.Data in the staging area is NOT visible to end users for queries, reports or analysis of any kind. It does not hold completed data ready for querying.
3.It may hold intermediate results, (if data is pipelined through a process)
4.Equally it may hold “state” data – the keys of the data held on the warehouse, and used to detect whether incoming data includes New or Updated rows. (Or deleted for that matter).
5.It is likely to be equal in size (or maybe larger) than the “presentation area” itself.
6.Although the “state” data – eg. Last sequence loaded may be backed up, much of the staging area data is automatically replaced during the ETL load processes, and can with care avoid adding to the backup effort. The presentation area however, may need backup in many cases.
7.It may include some metadata, which may be used by analysts or operators monitoring the state of the previous loads (eg. audit information, summary totals of rows loaded etc).
8.It’s likely to hold details of “rejected” entries – data which has failed quality tests, and may need correction and re-submission to the ETL process.
9.It’s likely to have few indexes (compared to the “presentation area”), and hold data in a quite normalised form. The presentation area (the bit the end users see), is by comparison likely to be more highly indexed (mainly bitmap indexes), with highly denormalised tables (the Dimension tables anyway).
Q.What are active transformation / Passive transformations?
A.An active transformation can change the number of rows as output after a transformation, while a passive transformation does not change the number of rows and passes through the same number of rows that was given to it as input.Transformations can be active or passive. An active transformation can change the number of rows that pass through it, such as a Filter transformation that removes rows that do not meet the filter condition. A passive transformation does not change the number of rows that pass through it, such as an Expression transformation that performs a calculation on data and passes all rows through the transformation
1.Active transformations: 
a.Advanced External Procedure
b.Aggregator
c.Application Source Qualifier
d.Filter
e.Joiner
f.Normalizer
g.Rank
h.Router
i.Update Strategy
2.Passive transformation: 
a.Expression
b.External Procedure
c.Maplet- Input
d.Lookup
e.Sequence generator
f.XML Source Qualifier
g.Maplet - Output
Q.What is the difference between Power Center & Power Mart?
A.Power Center is designed for: 
1.Low range of warehouses 
2.only for local repositories mainly desktop environment.
Power mart is designed for: 
1.High-end warehouses 
2.Global as well as local repositories 
3.ERP support.
Q.What is the difference between etl tool and olap tools?
A.ETL tool is ment for extraction data from the legecy systems and load into specified data base with some process of cleansing data.ex: Informatica,data stage.etc.OLAP is ment for Reporting purpose.in OLAP data avaliable in Mulitidimectional model. so that u can write smple query to extract data fro the data base.ex: Businee objects,Cognos.etc.ETL tools are used to extract, transformation and loading the data into data warehouse / data mart.OLAP tools are used to create cubes/reports for business analysis from data warehouse / data mart
Q.What are the various ETL tools? - Name a few
A.
1.Informatica
2.Abinitio
3.DataStage
4.Cognos Decision Stream 
5.Oracle Warehouse Builder 
6.Business Objects XI (Extreme Insight) 
7.SAP Business Warehouse 
8.SAS Enterprise ETL Server
Q.What are the various transformation available?
A.Transformation plays an important role in Datawarehouse. Transformation are used when data is moved from source to destination. Depding upon crieteria transformations are done. Some of the transformations in informatica are :
1.Aggregator Transformation 
2.Expression Transformation 
3.Filter Transformation 
4.Joiner Transformation 
5.Lookup Transformation 
6.Normalizer Transformation 
7.Rank Transformation 
8.Router Transformation 
9.Sequence Generator Transformation 
10.Stored Procedure Transformation 
11.Sorter Transformation 
12.Update Strategy Transformation 
13.XML Source Qualifier Transformation 
14.Advanced External Procedure Transformation 
15.External Transformation
Q.What are the different Lookup methods used in Informatica?
A.connected lookup will receive input from the pipeline and sends output to the pipeline and can return any number of values.Unconnected lookup can return only one column.
Q.Lets suppose we have some 10,000 odd records in source system and when load them into target how do we ensure that all 10,000 records that are loaded to target doesn't contain any garbage values.How do we test it. We can't check every record as number of records are huge.
A.Data Quality checks come in a number of forms:-
1.For FACT table rows, is there a valid lookup against each of the Dimensions
2.For FACT or DIMENSION rows, for each value:-
a.Is it Null when it shouldn’t be
b.Is the Data Type correct (eg. Number, Date)
c.Is the range of values or format correct
d.Is the row valid with relation to all the other source system business rules
There is no magic way of checking the integrity of data.You could simply count the number of rows in and out again and assume it’s all OK, but for a fact table (at the very minimum) you’ll need to cope with failed Dimension lookups (typically from late arriving Dimension rows).Classic solution is, include a Dimension Key Zero and Minus One (Null and Invalid) in your Dimension Table. Null columns are set to the Zero key, and a lookup failure to the Minus One. You may need to store and re-cycle rows with failed lookups and treat these as updates – so if the missing Dimension row appears, the data is corrected.
Otherwise, you’ve no option. If the incoming data is from an unreliable source, you’ll need to check it’s validity or accept the warehouse includes wrong results.To do this, you must profile the data at the source to know the domain of all the values, get the actual number of rows in the source, get the types of the data in the source. After it is loaded into the target, this process can be repeated i.e. checking the data values with respect to range, type, etc and also checking the actual number of rows inserted. If the result before and after match, then we are OK. This process is automated typically in ETL tools.
Q.What is ODS (operation data source)?
A.ODS Comes between staging area & Data Warehouse. The data is ODS will be at the low level of granularity.Once data was poopulated in ODS aggregated data will be loaded into into EDW through ODS.ODS is the Operational Data Source which is also called transactional data ODS is the source of a warehouse. Data from ODs is staged, transformed and then moved to datawarehouse.

Big Data FQA

1) What is a Big Data Architecture? Generally speaking, a Big Data Architecture is one which involves the storing and processing of data...