Friday 22 July 2016

Learn Hadoop In 10 Minutes


 Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing environment.

Similar to data residing in a local file system of personal computer system, in Hadoop, data resides in a distributed file system which is called as a Hadoop Distributed File system.

Processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster nodes(server) containing data.

This computational logic is nothing but a compiled version of a program written in a high level language such as Java. Such a program, processes data stored in Hadoop HDFS.

HADOOP is an open source software framework. Applications built using HADOOP are run on large data sets distributed across clusters of commodity computers.

Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.

Do you know?  Computer cluster consists of a set of multiple processing units (storage disk + processor) which are connected to each other and acts as a single system.

Components of Hadoop

Below diagram shows various components in Hadoop ecosystem-

Apache Hadoop consists of two sub-projects –

  1. Hadoop MapReduce : MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
  2. HDFS (Hadoop Distributed File System): HDFS takes care of storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in cluster. This distribution enables reliable and extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include are HiveHBaseMahoutSqoop , Flume and ZooKeeper.


Features Of 'Hadoop'

• Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since, it is processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called as data locality concept which helps increase efficiency of Hadoop based applications.

• Scalability

HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes, and thus allows for growth of Big Data. Also, scaling does not require modifications to application logic.

• Fault Tolerance

HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure, data processing can still proceed by using data stored on another cluster node.

Network Topology In Hadoop

Topology (Arrangment) of the network, affects performance of the Hadoop cluster when size of the hadoop cluster grows. In addition to the performance, one also needs to care about the high availability and handling of failures. In order to achieve this Hadoop cluster formation makes use of network topology.

Typically, network bandwidth is an important factor to consider while forming any network. However, as measuring bandwidth could be difficult, in Hadoop, network is represented as a tree and distance between nodes of this tree (number of hops) is considered as important factor in the formation of Hadoop cluster. Here, distance between two nodes is equal to sum of their distance to their closest common ancestor.

Hadoop cluster consists of data center, the rack and the node which actually executes jobs. Here, data center consists of racks and rack consists of nodes. Network bandwidth available to processes varies depending upon location of the processes. That is, bandwidth available becomes lesser as we go away from-

  • Processes on the same node
  • Different nodes on the same rack
  • Nodes on different racks of the same data center
  • Nodes in different data centers
    ​​


Hadoop Bigdata Online training by Best institute at Best price - Real time working faculty



FOR FREE DEMO contact us at:

Email : training@apex-online-it-training.com

Phone/WhattsApp : +91-(868) 622-2553 


COURSE HIGHLIGHTS 

  1. Instructor lead online training by Real time working faculty   

  2. Software  or lab access will be provided for unlimited hours

  3. Good quality Material will be provided

  4. Real time assignments to get hands on experience 

  5. Interview and job oriented training by exceptionally good trainers 

Course Content 

               HADOOP Using Cloudera

Development & Admin Course

 Introduction to BigData, Hadoop:-

ð  Big Data Introduction

ð  Hadoop Introduction

ð  What is Hadoop? Why Hadoop?

ð  Hadoop History?

ð  Different types of Components in Hadoop?

ð  HDFS, MapReduce, PIG, Hive, SQOOP, HBASE, OOZIE, Flume, Zookeeper and so on…

ð  What is the scope of Hadoop?

Deep Drive in HDFS (for Storing the Data):-

ð  Introduction of HDFS

ð  HDFS Design

ð  HDFS role in Hadoop

ð  Features of HDFS

ð  Daemons of Hadoop and its functionality

o   Name Node

o   Secondary Name Node

o   Job Tracker

o   Data Node

o   Task Tracker

ð  Anatomy of File Wright

ð  Anatomy of File Read

ð  Network Topology

o   Nodes

o   Racks

o   Data Center

ð  Parallel Copying using DistCp

ð  Basic Configuration for HDFS

ð  Data Organization

o   Blocks and

o   Replication

ð  Rack Awareness

ð  Heartbeat Signal

ð  How to Store the Data into HDFS

ð  How to Read the Data from HDFS

ð  Accessing HDFS (Introduction of Basic UNIX commands)

ð  CLI commands

MapReduce using Java (Processing the Data):-

ð  Introduction of MapReduce.

ð  MapReduce Architecture

ð  Data flow in MapReduce

o   Splits

o   Mapper

o   Portioning

o   Sort and shuffle

o   Combiner

o   Reducer

ð  Understand Difference Between Block and InputSplit

ð  Role of RecordReader

ð  Basic Configuration of MapReduce

ð  MapReduce life cycle

o   Driver Code

o   Mapper

o   and Reducer

ð  How MapReduce Works

ð  Writing and Executing the Basic MapReduce Program using Java

ð  Submission & Initialization of MapReduce Job.

ð  File Input/output Formats in MapReduce Jobs

o   Text Input Format

o   Key Value Input Format

o   Sequence File Input Format

o   NLine Input Format

ð  Joins

o   Map-side Joins

o   Reducer-side Joins

ð  Word Count Example

ð  Partition MapReduce Program

ð  Side Data Distribution

o   Distributed Cache (with Program)

ð  Counters (with Program)

o   Types of Counters

o   Task Counters

o   Job Counters

o   User Defined Counters

o   Propagation of Counters

ð  Job Scheduling

 

 

PIG:-

ð  Introduction to Apache PIG

ð  Introduction to PIG Data Flow Engine

ð  MapReduce vs PIG in detail

ð  When should PIG used?

ð  Data Types in PIG

ð  Basic PIG programming

ð  Modes of Execution in PIG

o   Local Mode and

o   MapReduce Mode

ð  Execution Mechanisms

o   Grunt Shell

o   Script

o   Embedded

ð  Operators/Transformations in PIG

ð  PIG UDF's with Program

ð  Word Count Example in PIG

ð  The difference between the MapReduce and PIG

 

 

SQOOP:-

ð  Introduction to SQOOP

ð  Use of SQOOP

ð  Connect to mySql database

ð  SQOOP commands

o   Import

o   Export

o   Eval

o   Codegen and etc…

ð  Joins in SQOOP

ð  Export to MySQL

HIVE:-

ð  Introduction to HIVE

ð  HIVE Meta Store

ð  HIVE Architecture

ð  Tables in HIVE

o   Managed Tables

o   External Tables

ð  Hive Data Types

o   Primitive Types

o   Complex Types

ð  Partition

ð  Joins in HIVE

ð  HIVE UDF's and UADF's with Programs

ð  Word Count Example

 

HBASE:-

ð  Introduction to HBASE

ð  Basic Configurations of HBASE

ð  Fundamentals of HBase

ð  What is NoSQL?

ð  HBase DataModel

o   Table and Row

o   Column Family and Column Qualifier

o   Cell and its Versioning

ð  Categories of NoSQL Data Bases

o   Key-Value Database

o   Document Database

o   Column Family Database

ð  SQL vs NOSQL

ð  How HBASE is differ from RDBMS

ð  HDFS vs HBase

ð  Client side buffering or bulk uploads

ð  HBase Designing Tables

ð  HBase Operations

o   Get

o   Scan

o   Put

o   Delete

MongoDB:--

 

ð  What is MongoDB?

ð  Where to Use?

ð  Configuration On Windows

ð  Inserting the data into MongoDB?

ð  Reading the MongoDB data.

Cluster Setup:--

ð  Downloading and installing the Ubuntu12.x

ð  Installing Java

ð  Installing Hadoop

ð  Creating Cluster

ð  Increasing Decreasing the Cluster size

ð  Monitoring the Cluster Health

ð  Starting and Stopping the Nodes

OOZIE

ð  Introduction to OOZIE

ð  Use of OOZIE

ð  Where to use?

 

Hadoop Ecosystem Overview

Oozie

HBase

Pig

Sqoop

Casandra

Chukwa

Mahout

Zoo Keeper

Flume

 

Ø Case Studies Discussions

Ø Certification Guidance

Ø Real Time Certification and

Ø interview Questions and Answers

Ø Resume Preparation

Ø Providing all Materials nd Links

Ø Real time Project Explanation and Practice


FOR FREE DEMO contact us at:

Email : training@apex-online-it-training.com

Phone/WhattsApp : +91-(868) 622-2553