Introduction to Hadoop Architecture

Andanayya August 18, 2019 Posts Comments Off on Introduction to Hadoop Architecture 803 Views

Hadoop is an open source Distributed processing framework that manages data processing and storage for big data applications running in clustered environments.

Hadoop Service Architecture

HDFS(Hadoop Distributed File System) Overview

Hadoop is normally deployed on a group of machines (Cluster)

Each machine in cluster is node
One of the node acts as the master node, This node manages the overall file system

The namenode stores

The directory structure
Metadata for all the files

Other nodes are called datanodes

The data is physically stored on these nodes

Let’s see how this files is stored in HDFS

First the file is broken into blocks of size 128 MB

This size is chosen to minimize the time to seek to the block on the disk
The blocks are the stored across the data nodes

Over all storage picture

Block locations for each files are stored in namenode
A file is read using

The metadata in namenode.
The blocks in the datanode.
The default replication factor is 3.

Features of HDFS

High Availability
Fault tolerance
Data Reliability
Replication
Distributed storage
Scalability

YARN Overview

YARN(Yet Another Resource Negotiator) is used for the management of resources on the Hadoop cluster.

YARN co-ordinates all the different MapReduce task running on the cluster.
YARN also monitors for failures and assigns new nodes when other fail

Sample MapReduce workflow

User defines map and reduce tasks using the MapReduce API
A job will be triggered on the cluster
YARN figures out where and how to rub the job, and stores the result in HDFS

YARN does this using 2 services

Resourcemanager and Nodemanager

There is 1 ResourcManager for a Hadoop cluster and the ResourceManager service runs on a single node -usually the same node as HDFS namenode

The Resource manager launches tasks that are submitted to YARN.
It optimizes for cluster utilization based on constraints such as capacity guarantees, fairness.

A NodeManager service runs on each node in the cluster i.e. all the data nodes

The NodeManager launches and monitors all tasks running on that node.
It coordinates with the ResourceManager in order to perform its tasks.
It monitors resources, logs,tracks the health of the node etc. everything related to the one node that is in its charge.

IT2EDU Empowering Education Through Technology

Introduction to Hadoop Architecture

Related Articles

About Andanayya

Check Also

How to Use Google Key Word Planner

Introduction to Hadoop Architecture

Related Articles

Exploring Top 5 Google AdSense Alternatives for Monetizing Your WordPress Website

Best Online Work From Home Jobs – Virtual Employment

Step-by-Step Guide: Connect Oracle Database in SSIS Package (With Data Flow Task) – Boost Your Data Integration [2023]

About Andanayya

Check Also

How to Use Google Key Word Planner