1) What is a Big Data Architecture?
Generally
speaking, a Big Data Architecture is one which involves the storing and
processing of data in a file system rather than a relational database. In most
cases, this provides a mechanism to summaries and/or clean data prior to
loading it into a database for further analysis.
2)
Why would you want to store and process data in a file system?
Historically,
companies have adopted relational databases to store and process their data. However,
there are two issues which are creeping up on a number of organizations:
a) Data volumes are growing exponentially; it is becoming
ever most costly to store all your data in a relational database. Storing and
processing data in a file system is relatively cheap in comparison and is
highly scalable.
b) To gain a competitive edge, Organizations need to bring a
greater variety of data sources together to perform meaningful analysis.
Relational databases have always been good at analyzing "structured"
data but they often have limitations when dealing with "un-structured"
data sources
3)
What is "un-structured" data?
Relational
databases store data in a very structured format - they contain tables with
records that have a fixed number of columns with specific data types i.e. the
database consists of a data-model.
"Un-structured"
data sources are ones where the data-model is not so well defined. For example,
consider the following:
- An
employee's CV
- Social
media data e.g. a Twitter feed
- A
customer's review
- A
server log file
It
is probably more accurate to say "semi-structured" data, since all
data surely has some structure even if the structure is quite vague or complex!
But either way, one purpose of Big Data is to provide a mechanism for making
use of your un-structured data sources. This often means summarizing it, making
it more structured and then combining it with other structured data sources for
further analysis.
4)
How do you report on un-structured data?
In a
relational database world, you can't (or it is quite difficult). You have to
convert your un-structured sources to a more structured format first before you
can report on them. For example:
- Key word extraction: Picking out the
common terms or words mentioned in Twitter feeds or CVs
- Sentiment Analysis: Determining
whether the sentiment in a phrase or paragraph is "positive" or
"negative"
- Log Parsing: Parsing log
files to extract error messages and other useful messages
- Entity Extraction: Identify nouns,
phone numbers, addresses from textual data
These
processes would be useful for the following types of Business Intelligence
query:
- How
many employees do we have who can speak German?
- How
many customers in each country have given us negative feedback in the last
week?
....and so on
5) What
is Hadoop and NoSQL?
Apache
Hadoop is widely regarded as the main foundation of a Big Data Architecture.
Hadoop is open-source and provides a file system that allows you to store huge
volumes of data and it supports distributed processing across clusters of
computers. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. It is "highly-available",
so losing any single Hadoop node will result in zero data loss and processing
can continue unaffected.
NoSQL
(Not Only SQL) is a type of database that can store data in different formats
to the standard "structured" schemas used with relational databases,
such as "key-value" (KV) pair format. Just as a basic example, here
is how relational and KV formats can differ when storing a person's Id, Name,
Date of Birth and Company:
Relational:
(1234,
John Smith,1976-14-05, Oracle Corporation)
Key-Value
Pair:
(1234,Name,John
Smith)
(1234,DoB,1976-14-05)
(1234,Company,Oracle Corporation)
(1234,DoB,1976-14-05)
(1234,Company,Oracle Corporation)
Key-value
pair format is very useful when the number of columns of information is
extremely large or not known. As an example, with Twitter feeds the number of
pieces of information that is supplied with each Tweet could vary (some users
will allow their Lat/Long geo-locations to be made public whilst others do
not).
6)
Can Oracle BI (OBIEE) report directly against Hadoop and NoSQL?
Yes.
It is possible for Oracle BI to query Hadoop structures using a Hadoop utility
such as Hive. Hive makes it possible to present structures as relational
objects and therefore you can report against them using standard SQL commands
via a JDBC connection
No.
It is not possible for Oracle BI to report against Oracle NoSQL. You will need
to write scripts to extract data from NoSQL where it can then be consumed by a
relational database.
7)
This all sounds great! Shall we get rid of our Data Warehouse then and just use
Hadoop?
No
you don't want to go down this route. When you run queries against Hadoop, they
are essentially batch processes that can run massively in parallel. Whilst this
is extremely useful, you won't get the response times and levels of concurrency
that are delivered by a relational database. Perhaps you can think of it this
way:
- Hadoop
is designed for huge batch queries, but only a small number of them taking
place at any one time
- A
relational database is designed for mixed workloads with many
small/medium/large processes all happening at the same time
8)
How do I get data out of Hadoop and into an Oracle Database?
Oracle
provide "Big Data Connectors" that enable Oracle
Data Integrator (ODI) to extract/load data between Hadoop/NoSQL and an Oracle
Database. These connectors require additional licenses but are relatively low
cost (and anyone can use them).
Oracle
also provides "Big Data SQL" which enables you to
create "external tables" in the Oracle Database that present Hadoop
structures as standard Oracle Database tables. So you can run any type of
database SQL query against a table and the processing will all be done on the
Hadoop file system. This facility however is only available for customers who
have purchased an Oracle Big Data Appliance (BDA).
9)
What is Oracle Big Data Discovery?
Historically,
one of the issues with a Big Data Architecture is that you don't know what your
data will look like until you've extracted it, loaded it into a relational
database and then built some reports.
Oracle
Big Data Discovery overcomes this issue by building graphs and other visualizations
direct against the structures in Hadoop. The benefit is that it compliments
your existing Business Intelligence tools by enabling you to explore your data
(summaries, join, transform etc.) at source to see whether it contains any
value and to assist with defining further reporting and processing requirements.