Monday 1 August 2016

BIGDATA, HADOOP Basics

An introduction to Big Data and HADOOP
In this data world, we are seeing the data everywhere. The rate of generation of data is skyrocketing. The most significant thing is how to process and store the ever proliferating data. The field of big data demands more storage capacity as well as high processing power. Some sources of big data including data from sensors, CCTV cameras, satellites, social networks such as Facebook, online shopping, airlines, hospitality data etc. all these sources generate huge data. Imagine, 90 out of 100% of data have been generated after the year of 2013. 
      Data centers have servers that store data enormously and they are called as sandboxes. Here, the meaning of processing is modifying the data. For that, data need to be fetched from a data-center and modified or altered and updated in the local machine (PC).

Traits of Big Data
Some of the distinct qualities of Big data are,
Larger amounts of information
Variety of Data
Data generated by several sources
Data retained for longer periods
Data utilized by more types of applications

               A survey conducted in the year of 2015 says that roughly there are 500 million tweets, 1.1 million credit card transactions, 4.5 billion likes on Facebook taking place every day. Here, the size of data can be recognized through the following table, 

Database size
Common characteristics
1 Gigabyte
Information generated by traditional enterprise applications
Typically consists of transactional data, stored in relational database
Uses Structured Query Language (SQL) as the access method
1 terabyte
Standard size for data warehouses
Often aggregated from multiple databases in the 1-100 gigabytes range
Drives enterprise analytics and business intelligence

1 petabyte
Frequently populated by mass data collection – often automated
Contains unstructured information
Serves as a catalyst for new big data related technologies

    Data has been increasing rapidly in terms of Gigabytes, Terabytes, Petabytes, Zettabyte. 

            Among the whole volume of data, 70 to 80% are unstructured or semi-structured data. Facebook uses videos, images, text messages and audio are the examples of unstructured data. 
               Log files are an example for semi-structured data. If we are logging into google, a log file is generated and stored in google server. A user can have many Gmail accounts.
When we have huge amount of data, the processing speed becomes less. We are getting a huge data concurrently. In order to equalize, we need to have better processing power to handle those enormous data. That’s why the Hadoop solution was introduced as a best solution for the big data. 
          For that, we use parallel processing concept where the huge amount of can be processed by several servers simultaneously.
               Hadoop is helping to store huge amount of data taking very less time. 
History of Hadoop
2003 - GFS (Google File System) for storing
2004 – MapReduce for processing
2006 – HDFS (by yahoo)
2007- MapReduce

HDFS and MapReduce are the two core concepts of Hadoop.

Who is the inventor of Hadoop?
Doug Cutting. He introduced the Hadoop logo – an elephant.
The abbreviation for HDFS is Hadoop Distributed File System.
Hadoop is a platform that executes distributed data processing technique.
MapReduce is the technique for processing the data which are to be stored in the HDFS.
Hadoop is an open source framework developed and overseen by Apache software foundation. Hadoop stores and processes the huge amount of data with the cluster of commodity hardware.
HDFS is designed to run on low-cost commodity hardware and it is highly fault tolerant. The main architectural goal of HDFS is detection of faults, quick and automatic recovery. 

HDFS is tuned to support large files in the size of terabytes. A file once created, written an stored need not be changed.




          HDFS has master / slave architecture. A HDFS cluster consists of single Name node (a master server) that manages the filesystem namespace and and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
         The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

Data Replication
           HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

      The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

           Clusters are the large-scale Hadoop environment commonly deployed on a collection of inexpensive, commodity servers. Clusters achieve high degrees of scalability merely by adding extra servers when added, and frequently employ replication to increase resistance to failure.
       Real time data processing is a machine-driven interactions with data – often continuous. The results of this type of processing commonly serve as an input to subsequent real-time operations.
          Data node is responsible for storing data in the Hadoop file system. Data is typically replicated across multiple data nodes to provide redundancy.
Unstructured and semi-structured information are written in Extensible Markup Language (XML). XML files are a great example of semi-structured data. Examples of unstructured data are XML, images, audio, movie clips, and so on.
HDFS (Hadoop Distributed File system) is designed for probability, scalability and large scale distribution. Written in JAVA, HDFS employs replication to help increase reliability of its storage.
         HIVE is a data warehousing infrastructure constructed on top of Hadoop. Offers query, analysis, and data summarization capabilities.
     MapReduce – distributed, parallel processing techniques quickly deriving massive amount of information.
        Mirroring is a technique for safeguarding information by copying it across multiple disks. The disk drive, operating system or specialized software can provide mirroring.
Name Node: 
         Maintains directory details of all files in the Hadoop File System. Clients interact with the Name Node whenever seek to locate or interact with a given file. The Name Node responds to these inquiries by returning a list of the Data Node servers where the file in question resides.

      There are two kinds of data primarily handled in Big Data. They are structured and unstructured data. Structured data comprises of table formats, flat files. Unstructured data comprises of video files, meteorological reports, satellite images.
Why big data is opted over other technologies? 
      Because, it processes the data so quickly and efficiently. We dump all the data in single machine and that machine should be in high configuration. If the server goes down, everything will be lost. So, in order to protect the data and ensuring their security, they are replicated and stored in multiple servers. That is called as "Distributed file system". 
Three fundamental characteristics of Distributed File System are, 
Velocity – speed
Variety – structured and unstructured data
Volume – Amount of data (terabytes, hexabytes) (Size)

Here, Namenode means server and Datanode means CPUs. Files are stored in Database. 
Information about that files are stored in Namenode.
Default replications (copies) is 3. But it is adjustable one. The act of replication enables fault tolerance.

HIVE uses SQL like syntax called Hive Query Language (Hive QL).
HIVE sits on the top of HADOOP.
Big Data is a collection of large datasets
Data is stored in distributed environment.
HADOOP
HIVE is used for performing MapReduce operations.
HIVE is designed for OLAP.
HIVE stores schema in a database and processed data into HDFS.





No comments:

Post a Comment