Wednesday, 25 October 2017

Big Data FQA

1) What is a Big Data Architecture?

Generally speaking, a Big Data Architecture is one which involves the storing and processing of data in a file system rather than a relational database. In most cases, this provides a mechanism to summaries and/or clean data prior to loading it into a database for further analysis.

2) Why would you want to store and process data in a file system?

Historically, companies have adopted relational databases to store and process their data. However, there are two issues which are creeping up on a number of organizations:

a) Data volumes are growing exponentially; it is becoming ever most costly to store all your data in a relational database. Storing and processing data in a file system is relatively cheap in comparison and is highly scalable.

b) To gain a competitive edge, Organizations need to bring a greater variety of data sources together to perform meaningful analysis. Relational databases have always been good at analyzing "structured" data but they often have limitations when dealing with "un-structured" data sources

3) What is "un-structured" data?

Relational databases store data in a very structured format - they contain tables with records that have a fixed number of columns with specific data types i.e. the database consists of a data-model.
"Un-structured" data sources are ones where the data-model is not so well defined. For example, consider the following:
  • An employee's CV
  • Social media data e.g. a Twitter feed
  • A customer's review
  • A server log file
It is probably more accurate to say "semi-structured" data, since all data surely has some structure even if the structure is quite vague or complex! But either way, one purpose of Big Data is to provide a mechanism for making use of your un-structured data sources. This often means summarizing it, making it more structured and then combining it with other structured data sources for further analysis.

4) How do you report on un-structured data?

In a relational database world, you can't (or it is quite difficult). You have to convert your un-structured sources to a more structured format first before you can report on them. For example:
  • Key word extraction: Picking out the common terms or words mentioned in Twitter feeds or CVs
  • Sentiment Analysis: Determining whether the sentiment in a phrase or paragraph is "positive" or "negative"
  • Log Parsing: Parsing log files to extract error messages and other useful messages
  • Entity Extraction: Identify nouns, phone numbers, addresses from textual data
These processes would be useful for the following types of Business Intelligence query:
  • How many employees do we have who can speak German?
  • How many customers in each country have given us negative feedback in the last week?
    ....and so on

5) What is Hadoop and NoSQL?

Apache Hadoop is widely regarded as the main foundation of a Big Data Architecture. Hadoop is open-source and provides a file system that allows you to store huge volumes of data and it supports distributed processing across clusters of computers. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is "highly-available", so losing any single Hadoop node will result in zero data loss and processing can continue unaffected.

NoSQL (Not Only SQL) is a type of database that can store data in different formats to the standard "structured" schemas used with relational databases, such as "key-value" (KV) pair format. Just as a basic example, here is how relational and KV formats can differ when storing a person's Id, Name, Date of Birth and Company:

Relational:
(1234, John Smith,1976-14-05, Oracle Corporation)
Key-Value Pair:
(1234,Name,John Smith)
(1234,DoB,1976-14-05)
(1234,Company,Oracle Corporation)

Key-value pair format is very useful when the number of columns of information is extremely large or not known. As an example, with Twitter feeds the number of pieces of information that is supplied with each Tweet could vary (some users will allow their Lat/Long geo-locations to be made public whilst others do not).

6) Can Oracle BI (OBIEE) report directly against Hadoop and NoSQL?

Yes. It is possible for Oracle BI to query Hadoop structures using a Hadoop utility such as Hive. Hive makes it possible to present structures as relational objects and therefore you can report against them using standard SQL commands via a JDBC connection
No. It is not possible for Oracle BI to report against Oracle NoSQL. You will need to write scripts to extract data from NoSQL where it can then be consumed by a relational database.

7) This all sounds great! Shall we get rid of our Data Warehouse then and just use Hadoop?

No you don't want to go down this route. When you run queries against Hadoop, they are essentially batch processes that can run massively in parallel. Whilst this is extremely useful, you won't get the response times and levels of concurrency that are delivered by a relational database. Perhaps you can think of it this way:
  • Hadoop is designed for huge batch queries, but only a small number of them taking place at any one time
  • A relational database is designed for mixed workloads with many small/medium/large processes all happening at the same time

8) How do I get data out of Hadoop and into an Oracle Database?

Oracle provide "Big Data Connectors" that enable Oracle Data Integrator (ODI) to extract/load data between Hadoop/NoSQL and an Oracle Database. These connectors require additional licenses but are relatively low cost (and anyone can use them).
Oracle also provides "Big Data SQL" which enables you to create "external tables" in the Oracle Database that present Hadoop structures as standard Oracle Database tables. So you can run any type of database SQL query against a table and the processing will all be done on the Hadoop file system. This facility however is only available for customers who have purchased an Oracle Big Data Appliance (BDA).

9) What is Oracle Big Data Discovery?

Historically, one of the issues with a Big Data Architecture is that you don't know what your data will look like until you've extracted it, loaded it into a relational database and then built some reports.
Oracle Big Data Discovery overcomes this issue by building graphs and other visualizations direct against the structures in Hadoop. The benefit is that it compliments your existing Business Intelligence tools by enabling you to explore your data (summaries, join, transform etc.) at source to see whether it contains any value and to assist with defining further reporting and processing requirements.


Tuesday, 2 May 2017

Grep Command in UNIX and Linux

Grep is the frequently used command in UNIX (or Linux). Most of us use grep just for finding the words in a file. The power of grep comes with using its options and regular expressions. You can analyze large sets of log files with the help of grep command.

Grep stands for Global search for Regular Expressions and Print.

The basic syntax of grep command is

grep [options] pattern [list of files]

Let see some practical examples on grep command.

1. Running the last executed grep command

This saves a lot of time if you are executing the same command again and again.

!grep
This displays the last executed grep command and also prints the result set of the command on the terminal.

2. Search for a string in a file

This is the basic usage of grep command. It searches for the given string in the specified file.

grep "Error" logfile.txt
This searches for the string "Error" in the log file and prints all the lines that has the word "Error".

3. Searching for a string in multiple files.

grep "string" file1 file2
grep "string" file_pattern
This is also the basic usage of the grep command. You can manually specify the list of files you want to search or you can specify a file pattern (use regular expressions) to search for.

4. Case insensitive search

The -i option enables to search for a string case insensitively in the give file. It matches the words like "UNIX", "Unix", "unix".

grep -i "UNix" file.txt

5. Specifying the search string as a regular expression pattern.


grep "^[0-9].*" file.txt
This will search for the lines which starts with a number. Regular expressions is huge topic and I am not covering it here. This example is just for providing the usage of regular expressions.

6. Checking for the whole words in a file.

By default, grep matches the given string/pattern even if it found as a substring in a file. The -w option to grep makes it match only the whole words.

grep -w "world" file.txt

7. Displaying the lines before the match.

Some times, if you are searching for an error in a log file; it is always good to know the lines around the error lines to know the cause of the error.

grep -B 2 "Error" file.txt
This will prints the matched lines along with the two lines before the matched lines.

8. Displaying the lines after the match.

grep -A 3 "Error" file.txt
This will display the matched lines along with the three lines after the matched lines.

9. Displaying the lines around the match

grep -C 5 "Error" file.txt
This will display the matched lines and also five lines before and after the matched lines.

10. Searching for a sting in all files recursively

You can search for a string in all the files under the current directory and sub-directories with the help -r option.

grep -r "string" *

11. Inverting the pattern match

You can display the lines that are not matched with the specified search sting pattern using the -v option.

grep -v "string" file.txt

12. Displaying the non-empty lines

You can remove the blank lines using the grep command.

grep -v "^$" file.txt

13. Displaying the count of number of matches.

We can find the number of lines that matches the given string/pattern

grep -c "sting" file.txt

14. Display the file names that matches the pattern.

We can just display the files that contains the given string/pattern.

grep -l "string" *

15. Display the file names that do not contain the pattern.

We can display the files which do not contain the matched string/pattern.

grep -L "string" *

16. Displaying only the matched pattern.

By default, grep displays the entire line which has the matched string. We can make the grep to display only the matched string by using the -o option.

grep -o "string" file.txt

17. Displaying the line numbers.

We can make the grep command to display the position of the line which contains the matched string in a file using the -n option

grep -n "string" file.txt

18. Displaying the position of the matched string in the line

The -b option allows the grep command to display the character position of the matched string in a file.

grep -o -b "string" file.txt

19. Matching the lines that start with a string

The ^ regular expression pattern specifies the start of a line. This can be used in grep to match the lines which start with the given string or pattern.

grep "^start" file.txt

20. Matching the lines that end with a string

The $ regular expression pattern specifies the end of a line. This can be used in grep to match the lines which end with the given string or pattern.

grep "end$" file.txt


What is UNIX

UNIX is a multi-user multitasking-optimized operating system that can run on various hardware platforms.

1. What command can you use to display the first 3 lines of text from a file and how does it work?
There are two commands that can be used to complete this task:
        head -3 test.txt – this uses the “head” command, along with the “-3” parameter that indicates the number of lines to be displayed;
        sed ‘4,$ d’ test.txt – this command uses the Sed text editor to perform the task. If the command was simply “sed test.txt” the whole file would have been displayed; however, in our example the delete parameter was used (d) to make Sed delete everything between the 4th and the last line (defined by the $ parameter), leaving only the first 3 lines of the file. It is important to mention that Sed does not actually delete the lines from the file itself, but just from the output result.
2. How can you remove the 7th line from a file?
The easiest way is by using the following command: sed -i ‘7 d’ test.txt
Unlike the previous Sed command, this command also has the “-i” parameter, which tells Sed to make the change in the actual file.
3. How do you reverse a string?
You can reverse a string by using a simple piping of two commands: echo “Mary” | rev
The first command will generate the output “Mary”, which will become the input for the rev command, making it return the reverse: “yraM”.
AIX (Advanced Interactiveexecutive)
AIX is an open operating system from IBM that is based on a version of UNIX.
AIX/ESA was designed for IBM's System/390 or large server hardware platform.
BASIC FILE HANDLING
ls
- list files in directory; use with options
-l (long format)
-a (list . files too)
-r (reverse order)
-t (newest appears first)
-d (do not go beyond current directory)
-i (show inodes)

- used to control input by pages - like the dos /p argument with dir. e.g. 

- show present working directory. e.g.
$ pwd
/usr/live/data/epx/vss2

To change the current working directory use 
cd

- change directory (without arguments, this is the same as $ cd $HOME or $ cd ~)

<source><destination> - move a file from one location to another. e.g.
$ mv /tmp/jon/handycommands.txt . # movehandycommands in /tmp/jon to current directory
$ mv -f vihelp vihelp.txt # Move file vihelp to vihelp.txt (forced) 

Options
·         -f (to force the move to occur)
·         -r (to recursively move a directory)
·         -p (to attempt to preserve permissions when moving)

<filename> - removes a file. e.g.
$ rm /tmp/jon/*.unl # remove all *.unl files in /tmp/jon
$ rm -r /tmp/jon/usr # remove all files recursively
 Options
·         -f (to force the removal of the file)
·         -r (to recursively remove a directory)

Recursively lists directories and their sizes. e.g.
$ du /etc # list recursively all directories off /etc
712 /etc/objrepos
64 /etc/security/audit
536 /etc/security
104 /etc/uucp
8 /etc/vg
232 /etc/lpp/diagnostics/data
240 /etc/lpp/diagnostics
248 /etc/lpp
16 /etc/aliasesDB
16 /etc/acct
8 /etc/ncs
8 /etc/sm
8 /etc/sm.bak
4384 /etc
The sizes displayed are in 512K blocks. To view this in 1024K blocks use the option -k

lp -d<Printername><Filename>
send file to printer. e.g. $ lp -dhplas14 /etc/motd # send file /etc/motd to printer hplas14
$ lp /etc/motd # send file /etc/motd to default printer

chmod <Octal Permissions><file(s)>
- change file permissions. e.g.
$ chmod 666 handycommands
changes the permissions (seen by 
ls -l) of the file handycommands to -rw-rw-rw-
r = 4, w = 2, x = 1. In the above example if we wanted read and write permission for a particular file then we would use r + w = 6. If we then wanted to have the file have read-write permissions for User, Group and All, then we would have permissions of 666. Therefore the command to change is that above.
$ chmod 711 a.out
Changes permissions to: 
-rwx--x--x
Additional explanation of file permissions and user/group/all meaning are given in the description of 
ls -l
You may specify chmod differently - by expressing it in terms of + and - variables. For example
$ chmodu+s /usr/bin/su
will modify the "sticky bit" on su, which allows it to gain the same access on the file as the owner of it. What it means is "add s permission to user". So a file that started off with permissions of "-rwxr-xr-x" will change to "rwsr-xr-x" when the above command is executed. You may use "u" for owner permissions, "g" for group permissions and "a" for all.

chown <Login Name><file(s)>
- Change ownership of a file. Must be done as root. e.g.
chowninformix *.dat # change all files ending .dat to be owned by informix

chgrp <Group Name><file(s)>
- Change group ownership of a file. Must be done as root. e.g.
chgrp sys /.netrc # change file /.netrc to be owned by the group sys

mvdir <Source Directory><Destination Directory>
- move a directory - can only be done within a volume group. To move a directory between volume groups you need to use mv -r
or 
find <dirname> -print | cpio -pdumv<dirname2>rm -r <dirname>

cpdir <Source Directory><Destination Directory>
- copy a directory. See mvdir

rmdir <Directory>
- this is crap - use rm -r instead

mkdir <Directory>
- Creates a directory. e.g.
$ mkdir /tmp/jon/ # create directory called /tmp/jon/ 

head -<Number><FileName>
- prints out the first few line of a file to screen. Specify number to indicate how many lines (default is 10). e.g. If you sent something to a labels printer and it wasn't lined up, then you could print the first few labels again using:
$ head -45 label1.out | lp -dlocal1

tail -<Number><FileName>
- prints out the end of a file. Very similar to head but with a very useful option '-f' which allows you to follow the end of a file as it is being created.e.g.
$ tail -f vlink.log # follow end of vlink.log file as it is created.

wc -<options><FileName>
- Word Count (wc) program. Counts the number of chars, words, and lines in a file or in a pipe. Options:
·         -l (lines)
·         -c (chars)
·         -w (words)
To find out how many files there are in a directory do ls | wc -l

split -<split><FileName>
- Splits a file into several files.e.g.
$ split -5000 CALLS1 # will split file CALLS1 into smaller files of 5000 lines each called xaa, xab, xac, etc.

- cut's the file or pipe into various fields. e.g.
$ cut -d "|" -f1,2,3 active.unl # will take the file active.unl which is delimited by pipe symbols and print the first 3 fields options: 
·         -d <delimiter>
·         -f <fields>
Not too useful as you can't specify the delimiter as merely white space (defaults to tab) Alternatively, you can 'cut' up files by character positioning (useful with a fixed width file). e.g.
$ cut -c8-28 "barcode.txt" # would cut columns 8 to 28 out of the barcode.txt file.
- paste will join two files together horizontally rather than just tacking one on to the end of the other. e.g. If you had one file with two lines:
Name:
Employee Number:
and another file with the lines:
Fred Bloggs
E666
then by doing:
$ paste file1 file2 > file3 # this would then produce (in file3).
Name: Fred Bloggs
Employee Number: E666

- list users who are currently logged on (useful with option 'am i' - i.e. 'who am i' or 'whoami')

exit
- end current shell process. If you log in, then type this command, it will return you to login. ^D (control-D) and logout (in some shells) does the same.

- login to a remote machine, e.g.
$ rlogin hollandrs # log in to machine called hollandrs
Useful with -l option to specify username - e.g.
$ rlogin cityrs -l ismsdev # log in to machine cityrs as user ismsdev For further info about trust network see .rhosts file and /etc/resolv.conf (I think).

telnet
- very similar to rlogin except that it is more flexible (just type telnet with no arguments and then '?' to see the options). Useful because you can specify a telnet to a different port.

ftp
- File Transfer Protocol - a quick and easy method for transferring files between machines. The .netrc file in your $HOME directory holds initial commands. type ftp without arguments and then '?' to see options)

rcp
- Remote copy. Copies a file from one unix box to another, as long as they trust each other (see .rhosts file or /etc/resolv.conf I think). Options
·         -f (to force the copy to occur)
·         -r (to recursively copy a directory)
·         -p (to attempt to preserve permissions when copying)
logout
syntax: logout
This command should be issued to terminate your session and allow the next user to access the computer. The logout command will not execute if there are any stopped jobs. To logout, you must first kill or activate the jobs. The logout command will not terminate any jobs running in the background. It is imperative that you remember to kill all background jobs before logging out.
kill
syntax: kill [-signal] processID

This command is used to terminate a process. For the processID, you can specify a percent sign followed by the job ID returned by the jobs command, or the process ID returned by the ps command. Some processes may ignore the SIGTERM signal sent by the kill command. To terminate these processes, you must specify the -kill signal.

Big Data FQA

1) What is a Big Data Architecture? Generally speaking, a Big Data Architecture is one which involves the storing and processing of data...