Home / Posts / Introduction to Hadoop Architecture

Introduction to Hadoop Architecture

Hadoop is an open source Distributed processing framework that manages data processing and storage for big data applications running in clustered environments.

Hadoop Service Architecture

HDFS(Hadoop Distributed File System) Overview

Hadoop is normally deployed on a group of machines (Cluster)

  • Each machine in cluster is node
  • One of the node acts as the master node, This node manages the overall file system

The namenode stores

  1. The directory structure
  2. Metadata for all the files

Other nodes are called datanodes

  1. The data is physically stored on these nodes

Let’s see how this files is stored in HDFS

First the file is broken into blocks of size 128 MB

  • This size is chosen  to minimize the time to seek to the block on the disk
  • The blocks are the stored across the data nodes

Over all storage picture

Block locations for each files are stored in namenode
A file is read using

  1. The metadata in namenode.
  2. The blocks in the datanode.
  3. The default replication factor is 3.

Features of HDFS

  • High Availability
  • Fault tolerance
  • Data Reliability
  • Replication
  • Distributed storage
  • Scalability

YARN Overview

YARN(Yet Another Resource Negotiator) is used for the management of resources on the Hadoop cluster.

  • YARN co-ordinates all the different MapReduce task running on the cluster.
  • YARN also monitors for failures and assigns new nodes when other fail

Sample MapReduce  workflow

  1. User defines map and reduce tasks using the MapReduce API
  2. A job will be triggered on the cluster
  3. YARN figures out where and how to rub the job, and stores the result in HDFS

YARN does this using 2 services

Resourcemanager   and Nodemanager

There is 1 ResourcManager for a Hadoop cluster and the ResourceManager service runs on a single node -usually the same node as HDFS namenode

  • The Resource manager launches tasks that are submitted to YARN.
  • It optimizes for cluster utilization based on constraints such as capacity guarantees, fairness.

A NodeManager service runs on each node in the cluster i.e. all the data nodes

  • The NodeManager launches and monitors all tasks running on that node.
  • It coordinates with the ResourceManager in order to perform its tasks.
  • It monitors resources, logs,tracks the health of the node etc. everything related to the one node that is in its charge.



About Andanayya

Experienced Hadoop Developer with a demonstrated history of working in the computer software industry. Skilled in Big Data, Spark,Java,Sqoop, Hive,Spring Boot and SQL

Check Also

Anti Join vs Join in SQL: Differences

In SQL, an anti join is a type of JOIN that returns only the rows …