Home / Posts / Introduction to Map Reduce

Introduction to Map Reduce

MapReduce is a Distributed computing programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

MapReduce is defined of 2 functions

map() and  reduce() and The rest is taken care by Hadoop

Let’s take an example :

Objective : Create Frequency Distribution of words in the  file

we have a large text file

The text file has been divided into blocks and stored in HDFS

Each block here would represent a part of the text file

The map step will generate a list of key  value pairs on each node

From now on all the inputs and outputs are formatted as <key,value> pairs

These are all copied over to one single node, On that node an operation called Sort/Merge occurs

  • Map function will run once fore each  line of the  text file.
  • Both the  input and output need to be formatted as <key,value> pairs.
  • This operation can run in parallel on each data node there is no interdependency in the inputs and outputs.
  • At the end of the  map phase, we have a set of key-value pairs from each data node
  • All of these results are first copied over to a single node.
  • The data is  sorted based on key,key-value pairs with the same key are merged.
  • reduce() will run on each  pair generated by the sort/merge step

Map Reduce Entire Flow Diagram.




About Andanayya

Experienced Hadoop Developer with a demonstrated history of working in the computer software industry. Skilled in Big Data, Spark,Java,Sqoop, Hive,Spring Boot and SQL

Check Also

Anti Join vs Join in SQL: Differences

In SQL, an anti join is a type of JOIN that returns only the rows …