Why Hadoop?
Due to new technologies and devices communication is increased day to day and data growing rapidly.
In the past few decades back data in the world is low sized and limited. After that data is increased day by day. Largest data generated by share markets, social networking sites and usage of internet.
The amount of data produced by us in the beginning of 2003 was 5 billion gigabytes.
Every day the data is increased. Every day, 2.5 quintillion bytes of data are created, with 90 percent of the world’s data created in the past two years alone.
Wal-Mart processes one million customer transactions per hour, stored in databases estimated to contain more than 2.5 peta bytes of data. These large data processed without any data loss or delay.
Big data means really a large data, it is a collection of large data sets.
The data comes from various sources.
1. Social media data
2. Stock exchange data
3. Power grid data
4. Import export data
5. Search engine data
6. Weather casting
7. Satellite data
These are some sources for large data sets.
Traditionally the data can be classified in three types
1. Structured data
2. Semi structured data
3. Unstructured data
Big data technologies are important in providing more accurate analysis which is very much useful for accurate decision making.
Installation Of Hadoop
Hadoop is run on Linux kernel. If you want to install Hadoop on windows OS, Cygwin need to install in your machine.
Cygwin is creating linux like environment in windows. Here is the link to get cygwin. https://cygwin.com/install.html
Hadoop can be installed in Multi Node cluster / single node cluster. any one we choose.
In this blog I posted Installation of single node cluster in your machine on Linux OS.
Step 1:
Java is mandatory to run Hadoop so check whether java is installed or not in your machine.
To check for java
$java –version
If you want to install Java in linux OS follow the below.
Step2:
$Sudo apt-get install oracle-java8-installer
Step3:
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine where Hadoop runs. For single node cluster need to configure SSH access to local host for user.
Generate a public key
$ssh-keygen -t rsa -P “”
Then you have to enable access to your local machine.
$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Step4:
Hadoop is free source from Apache software foundation go to that site and down load the Hadoop latest version which suitable to your machine.
Extract the downloaded tar file.
$tar xvfz hadoop-1.2.1.tar.gz
After extracting it do some following changes under Hadoop/conf.
Change 1:
Core-site.xml:
hadoop.tmp.dir
TEMPORARY-DIR-FOR-HADOOPDATASTORE
A base for other temporary directories
fs.default.name
hdfs://localhost:54310
Change2:
Mapared-site.xml:
mapred.job.tracker
localhost:54311
Change3:
Hdfs-site.xml:
dfs.replication
1
Step4:
Conf/slaves change to localhost
Step5:
Conf/master change to localhost
Step6:
Iit is essential to Setting up the environment variables for Hadoop and Java
For Temporary set up run the below command:
$export JAVA_HOME=/usr/lib/jvm/jdk1.8.0
$export HADOOP_COMMON_HOME=/home/hadoop/hadoop-install/hadoop-1.2.1
For permanent setting:
Open .bashrc and type end of the file append the below lines.
To open bashrc
$gedit ~/.bashrc
And type the below two lines
$export JAVA_HOME=/usr/lib/jvm/jdk1.8.0
$export HADOOP_COMMON_HOME=/home/hadoop/hadoop-install/hadoop-1.2.1
Once done the above run the below command.
$source ~/.bashrc
Step 7:
Format the Hadoop file system in Hadoop directory.
$./bin/hadoop namenode –format
Step 8:
Running the cluster.
$./bin/start-all.sh
Step 9:
To stop the cluster.
$./bin/stop-all.sh