An introduction to Big
Data and HADOOP
In this data world, we are seeing the data everywhere. The rate of generation of
data is skyrocketing. The most significant thing is how to process and store
the ever proliferating data. The field of big data demands more storage
capacity as well as high processing power. Some sources of big data including data
from sensors, CCTV cameras, satellites, social networks such as Facebook,
online shopping, airlines, hospitality data etc. all these sources generate
huge data. Imagine, 90 out of 100% of data have been generated after the year
of 2013.
Data centers have servers that store data enormously and
they are called as sandboxes. Here, the meaning of processing is modifying the data.
For that, data need to be fetched from a data-center and modified or altered and
updated in the local machine (PC).
Traits of Big Data
Some of the distinct qualities of Big data are,
Larger amounts of information
Variety of Data
Data generated by several sources
Data retained for longer periods
Data utilized by more types of applications
Traits of Big Data
Some of the distinct qualities of Big data are,
Larger amounts of information
Variety of Data
Data generated by several sources
Data retained for longer periods
Data utilized by more types of applications
A survey conducted in the year of 2015 says that roughly there are 500 million tweets, 1.1 million credit card
transactions, 4.5 billion likes on Facebook taking place every day. Here, the size of data can be recognized through the following table,
Database size
|
Common characteristics
|
1 Gigabyte
|
Information generated by traditional
enterprise applications
Typically consists of transactional data,
stored in relational database
Uses Structured Query Language (SQL) as the
access method
|
1 terabyte
|
Standard size for data warehouses
Often aggregated from multiple databases in
the 1-100 gigabytes range
Drives enterprise analytics and business
intelligence
|
1 petabyte
|
Frequently populated by mass data collection
– often automated
Contains unstructured information
Serves as a catalyst for new big data related
technologies
|
Data has been increasing rapidly in terms of Gigabytes, Terabytes, Petabytes, Zettabyte.
Among the whole volume of data, 70 to 80% are unstructured or
semi-structured data. Facebook uses videos, images, text messages and audio are
the examples of unstructured data.
Log files are an example for semi-structured data. If we
are logging into google, a log file is generated and stored in google server. A
user can have many Gmail accounts.
When we have huge amount of data, the processing speed
becomes less. We are getting a huge data concurrently. In order to
equalize, we need to have better processing power to handle those enormous
data. That’s why the Hadoop solution was introduced as a best solution for the
big data.
For that, we use parallel processing concept where the huge
amount of can be processed by several servers simultaneously.
Hadoop is helping to store huge amount of data taking very
less time.
History of Hadoop
2003 - GFS (Google File System) for storing
2004 – MapReduce for processing
2006 – HDFS (by yahoo)
2007- MapReduce
HDFS and MapReduce are the two core concepts
of Hadoop.
Who is the inventor of Hadoop?
Doug Cutting. He introduced the Hadoop logo – an elephant.
The abbreviation for HDFS is Hadoop Distributed File
System.
Hadoop is a platform that executes distributed data
processing technique.
MapReduce is the technique for processing the data which
are to be stored in the HDFS.
Hadoop is an open source framework developed and overseen
by Apache software foundation. Hadoop stores and processes the huge amount of
data with the cluster of commodity hardware.
HDFS is designed to run on low-cost commodity
hardware and it is highly fault tolerant. The main architectural goal of HDFS
is detection of faults, quick and automatic recovery.
HDFS is tuned to support large files in the size
of terabytes. A file once created, written an stored need not be changed.
HDFS has master / slave architecture. A HDFS cluster
consists of single Name node (a master server) that manages the filesystem
namespace and and regulates access to files by clients. In addition, there are a number of
DataNodes, usually one per node in the cluster, which manage storage attached
to the nodes that they run on. HDFS exposes a file system namespace and allows
user data to be stored in files. Internally, a file is split into one or more
blocks and these blocks are stored in a set of DataNodes.
The
NameNode maintains the file system namespace. Any change to the file system
namespace or its properties is recorded by the NameNode. An application can
specify the number of replicas of a file that should be maintained by HDFS. The
number of copies of a file is called the replication factor of that file. This
information is stored by the NameNode.
Data Replication
HDFS
is designed to reliably store very large files across machines in a large
cluster. It stores each file as a sequence of blocks; all blocks in a file
except the last block are the same size. The blocks of a file are replicated
for fault tolerance. The block size and replication factor are configurable per
file. An application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can be changed
later. Files in HDFS are write-once and have strictly one writer at any time.
The
NameNode makes all decisions regarding replication of blocks. It periodically
receives a Heartbeat and a Blockreport from each of the DataNodes in the
cluster. Receipt of a Heartbeat implies that the DataNode is functioning
properly. A Blockreport contains a list of all blocks on a DataNode.
Clusters are the large-scale Hadoop
environment commonly deployed on a collection of inexpensive, commodity
servers. Clusters achieve high degrees of scalability merely by adding extra
servers when added, and frequently employ replication to increase resistance to
failure.
Real time data processing is a
machine-driven interactions with data – often continuous. The results of this
type of processing commonly serve as an input to subsequent real-time
operations.
Data node is responsible for storing
data in the Hadoop file system. Data is typically replicated across multiple data
nodes to provide redundancy.
Unstructured and semi-structured
information are written in Extensible Markup Language (XML). XML files are a
great example of semi-structured data. Examples of unstructured data are XML,
images, audio, movie clips, and so on.
HDFS (Hadoop Distributed File
system) is designed for probability, scalability and large scale distribution.
Written in JAVA, HDFS employs replication to help increase reliability of its
storage.
HIVE is a data warehousing
infrastructure constructed on top of Hadoop. Offers query, analysis, and data
summarization capabilities.
MapReduce – distributed, parallel
processing techniques quickly deriving massive amount of information.
Mirroring is a technique for
safeguarding information by copying it across multiple disks. The disk drive,
operating system or specialized software can provide mirroring.
Name Node:
Maintains directory details of all files in
the Hadoop File System. Clients interact with the Name Node whenever seek to
locate or interact with a given file. The Name Node responds to these inquiries
by returning a list of the Data Node servers where the file in question
resides.
There are two kinds of data
primarily handled in Big Data. They are structured and unstructured data.
Structured data comprises of table formats, flat files. Unstructured data
comprises of video files, meteorological reports, satellite images.
Why big data is opted over other
technologies?
Because, it processes the data so quickly and efficiently. We dump all the data in single
machine and that machine should be in high configuration. If the server goes
down, everything will be lost. So, in order to protect the data and ensuring their security, they are replicated and stored in multiple servers. That is called as "Distributed file system".
Three fundamental characteristics of Distributed File System are,
Velocity – speed
Variety – structured and
unstructured data
Volume – Amount of data (terabytes,
hexabytes) (Size)
Here, Namenode means server and Datanode means CPUs. Files are stored in Database.
Information about that files are
stored in Namenode.
Default replications (copies) is 3. But it is adjustable one. The act of replication enables fault tolerance.
HIVE uses SQL like syntax called
Hive Query Language (Hive QL).
HIVE sits on the top of HADOOP.
Big Data is a collection of large datasets
Data is stored in distributed
environment.
HADOOP
HIVE is used for performing
MapReduce operations.
HIVE is designed for OLAP.
HIVE stores schema in a database and
processed data into HDFS.
No comments:
Post a Comment