Wednesday, 25 October 2017

Big Data FQA

1) What is a Big Data Architecture?

Generally speaking, a Big Data Architecture is one which involves the storing and processing of data in a file system rather than a relational database. In most cases, this provides a mechanism to summaries and/or clean data prior to loading it into a database for further analysis.

2) Why would you want to store and process data in a file system?

Historically, companies have adopted relational databases to store and process their data. However, there are two issues which are creeping up on a number of organizations:

a) Data volumes are growing exponentially; it is becoming ever most costly to store all your data in a relational database. Storing and processing data in a file system is relatively cheap in comparison and is highly scalable.

b) To gain a competitive edge, Organizations need to bring a greater variety of data sources together to perform meaningful analysis. Relational databases have always been good at analyzing "structured" data but they often have limitations when dealing with "un-structured" data sources

3) What is "un-structured" data?

Relational databases store data in a very structured format - they contain tables with records that have a fixed number of columns with specific data types i.e. the database consists of a data-model.
"Un-structured" data sources are ones where the data-model is not so well defined. For example, consider the following:
  • An employee's CV
  • Social media data e.g. a Twitter feed
  • A customer's review
  • A server log file
It is probably more accurate to say "semi-structured" data, since all data surely has some structure even if the structure is quite vague or complex! But either way, one purpose of Big Data is to provide a mechanism for making use of your un-structured data sources. This often means summarizing it, making it more structured and then combining it with other structured data sources for further analysis.

4) How do you report on un-structured data?

In a relational database world, you can't (or it is quite difficult). You have to convert your un-structured sources to a more structured format first before you can report on them. For example:
  • Key word extraction: Picking out the common terms or words mentioned in Twitter feeds or CVs
  • Sentiment Analysis: Determining whether the sentiment in a phrase or paragraph is "positive" or "negative"
  • Log Parsing: Parsing log files to extract error messages and other useful messages
  • Entity Extraction: Identify nouns, phone numbers, addresses from textual data
These processes would be useful for the following types of Business Intelligence query:
  • How many employees do we have who can speak German?
  • How many customers in each country have given us negative feedback in the last week?
    ....and so on

5) What is Hadoop and NoSQL?

Apache Hadoop is widely regarded as the main foundation of a Big Data Architecture. Hadoop is open-source and provides a file system that allows you to store huge volumes of data and it supports distributed processing across clusters of computers. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is "highly-available", so losing any single Hadoop node will result in zero data loss and processing can continue unaffected.

NoSQL (Not Only SQL) is a type of database that can store data in different formats to the standard "structured" schemas used with relational databases, such as "key-value" (KV) pair format. Just as a basic example, here is how relational and KV formats can differ when storing a person's Id, Name, Date of Birth and Company:

Relational:
(1234, John Smith,1976-14-05, Oracle Corporation)
Key-Value Pair:
(1234,Name,John Smith)
(1234,DoB,1976-14-05)
(1234,Company,Oracle Corporation)

Key-value pair format is very useful when the number of columns of information is extremely large or not known. As an example, with Twitter feeds the number of pieces of information that is supplied with each Tweet could vary (some users will allow their Lat/Long geo-locations to be made public whilst others do not).

6) Can Oracle BI (OBIEE) report directly against Hadoop and NoSQL?

Yes. It is possible for Oracle BI to query Hadoop structures using a Hadoop utility such as Hive. Hive makes it possible to present structures as relational objects and therefore you can report against them using standard SQL commands via a JDBC connection
No. It is not possible for Oracle BI to report against Oracle NoSQL. You will need to write scripts to extract data from NoSQL where it can then be consumed by a relational database.

7) This all sounds great! Shall we get rid of our Data Warehouse then and just use Hadoop?

No you don't want to go down this route. When you run queries against Hadoop, they are essentially batch processes that can run massively in parallel. Whilst this is extremely useful, you won't get the response times and levels of concurrency that are delivered by a relational database. Perhaps you can think of it this way:
  • Hadoop is designed for huge batch queries, but only a small number of them taking place at any one time
  • A relational database is designed for mixed workloads with many small/medium/large processes all happening at the same time

8) How do I get data out of Hadoop and into an Oracle Database?

Oracle provide "Big Data Connectors" that enable Oracle Data Integrator (ODI) to extract/load data between Hadoop/NoSQL and an Oracle Database. These connectors require additional licenses but are relatively low cost (and anyone can use them).
Oracle also provides "Big Data SQL" which enables you to create "external tables" in the Oracle Database that present Hadoop structures as standard Oracle Database tables. So you can run any type of database SQL query against a table and the processing will all be done on the Hadoop file system. This facility however is only available for customers who have purchased an Oracle Big Data Appliance (BDA).

9) What is Oracle Big Data Discovery?

Historically, one of the issues with a Big Data Architecture is that you don't know what your data will look like until you've extracted it, loaded it into a relational database and then built some reports.
Oracle Big Data Discovery overcomes this issue by building graphs and other visualizations direct against the structures in Hadoop. The benefit is that it compliments your existing Business Intelligence tools by enabling you to explore your data (summaries, join, transform etc.) at source to see whether it contains any value and to assist with defining further reporting and processing requirements.


3 comments:

  1. Hi,
    Thanks for sharing, it was informative. We play a small role in upskilling people providing the latest tech courses. Join us to upgradeINFORMATICA ADMIN ONLINE TRAINING

    ReplyDelete
  2. I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.

    Data Engineering Services 

    AI & ML Solutions

    Data Analytics Services

    Data Modernization Services

    ReplyDelete

Big Data FQA

1) What is a Big Data Architecture? Generally speaking, a Big Data Architecture is one which involves the storing and processing of data...