Tuesday, 11 September 2018

Installation Manual for Hadoop-2.6.0 on Ubuntu-14.04 64 bit


Single Node Cluster

Download as a word file


Configure Proxy Setting if required
$ sudo gedit /etc/apt/apt.conf
Add the following lines
Acquire::http::Proxy "http://user-id:password@proxy-address:port ";
Acquire::https::Proxy "https://user-id:password@proxy-address:port";
Acquire::ftp::Proxy "ftp://user-id:password@proxy-address:port ";
Update the Ubuntu
$ sudo apt-get update
Install JDK
$ sudo apt-get install default-jdk
Or Alternative
$ sudo apt-get install openjdk-7-jdk
Check version if already installed.
$ java –version
Select particular Java as the default on your machine. See for more information
$ sudo update-alternatives --config java
$ sudo update-java-alternatives -s java-6-sun
$ update-alternatives --config java
[copy /usr/lib/jvm/java-7-openjdk-amd64]
Adding a dedicated Hadoop system user
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
Configure the sudo permissions for 'hduser'
$ sudo visudo
[Since by default ubuntu text editor is nano we will need to use CTRL + O to edit]
 [Add the permissions to sudoers i.e. add this line]
hduser ALL=(ALL) ALL
[Use CTRL + X keyboard shortcut to exit out. Enter Y to save the file.]
Install ssh
$ sudo apt-get install ssh
$ sudo apt-get install rsync
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Generate an SSH key for the hduser user
$ su - hduser
$ ssh-keygen -t rsa -P ""
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
hduser@ubuntu:~$
Second, you have to enable SSH access to your local machine with this newly created key
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$ ssh localhost
Download and Install Hadoop-2.6.0
$ wget -c [copied link of hadoop-2.6.0.tar.gz from mirror]
$ sudo mv hadoop-2.6.0.tar.gz Desktop/
$ cd Desktop/
$ sudo tar -zxvf hadoop-2.6.0.tar.gz
$ sudo mv hadoop-2.6.0 /usr/local/hadoop
$ sudo chown -R hduser:hadoop hadoop
Edit bashrc file
$ sudo gedit ~/.bashrc
Add the following lines at last in file
#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
Source the variables using the source command
$ source ~/.bashrc
Edit hadoop-env.sh file
$ cd /usr/local/hadoop/etc/hadoop/
$ sudo gedit hadoop-env.sh
Add path for JAVA_HOME
# the java implementation to use.
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
Add some more from other sites at last in file
# To set the Hadoop installation directory
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_PREFIX=/usr/local/hadoop
# To set Hadoop native library directory
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
# To disable IPv6 only for Hadoop
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
# To set the library directory
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib"
Edit yarn-env.sh file
$ cd /usr/local/hadoop/etc/hadoop/
$ sudo gedit yarn-env.sh
Add path for JAVA_HOME   
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
Create Temporary Directory
Hadoop default configuration uses hadoop.tmp.dir as the default base temporary directory both for the local file system and HDFS. To use other directory, create the directory and set required ownership and permission. If we do this then we must have add it in core-site.xml.
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown -R hduser:hadoop /app/hadoop/tmp
If you want to tighten up security, chmod from 755 to 750
#$ sudo chmod 0750 /app/hadoop/tmp
$ sudo chmod -R 755 /app/hadoop/tmp
Edit core-site.xml, yarn-site.xml, mapred-site.xml, hdfs-site.xml
$ cd /usr/local/hadoop/etc/hadoop/
core-site.xml
$ sudo gedit core-site.xml
Add the following configuration
<configuration>
            <property>
                        <name>fs.defaultFS</name>
                        <value>hdfs://localhost:9000</value>
            </property>
</configuration>
Add the following if default base temporary directory is changed.
yarn-site.xml
$ sudo gedit yarn-site.xml
Add the following configurations
<configuration>
            <property>
                        <name>yarn.nodemanager.aux-services</name>
                        <value>mapreduce_shuffle</value>
            </property>
            <property>
                        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                        <value> org.apache.hadoop.mapred.ShuffleHandler</value>
            </property>
</configuration>
mapred-site.xml
$ sudo cp mapred-site.xml.template mapred-site.xml
$ sudo gedit mapred-site.xml
Add the following configuration
<configuration>
            <property>
                        <name>mapreduce.framework.name</name>
                        <value>yarn</value>
            </property>
</configuration>
hdfs-site.xml
Add the following configurations. For default 128 MB (value written in bytes) for 64 MB (change it to 67108864)
<configuration>
            <property>
                        <name>dfs.replication</name>
                        <value>1</value>
            </property>
            <property>
                        <name>dfs.namenode.name.dir</name>
                        <value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
            </property>
            <property>
                        <name>dfs.datanode.data.dir</name>
                        <value>file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value>
            </property>
            <property>
                        <name>dfs.blocksize</name>
                        <value>134217728</value>
            </property>
</configuration>
Make the directory for namenode and datanode and change the ownership of hadoop
$ cd
$ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
$ sudo chown hduser:hadoop -R /usr/local/hadoop
Format the namenode
$ hdfs namenode -format
[if there is some error, check bashrc file and command: source ~/.bashrc]
Start the daemon process and check them
$ start-all.sh
$ jps
Access the Web UIs
Verify Hadoop installation With Web Interfaces. Open the browser with the following URL’s
HDFS-UI: http://localhost:50070 HDFS-UI
Resource Mnager UI: http://localhost:8088 ResourceManager
Job History UI: http://localhost:19888 NodeManager
Multi Node Cluster
Clone the single node installed Hadoop as master and slave1, slave2 and so on and put them in vm directory. To this first make clone of single node cluster as master node and configure it and then clone this master node as various slave nodes and configure these nodes.
Create Master Node
Create master node from single node cluster by cloning, power on master node and open terminal.
@Master Node
Find inet address
$ ifconfig
Let:
Namenode >    hadoopmaster > 192.168.23.130
Datanodes >     hadoopslave1 > 192.168.23.131
                        hadoopslave2 > 192.168.23.132
Edit hosts file
$ sudo gedit /etc/hosts
Add the name and address of nodes in the file
192.168.23.130           hadoopmaster
192.168.23.131           hadoopslave1
192.168.23.132           hadoopslave2
Edit hostname file
$ sudo gedit /etc/hostname                             [reboot]
[edit and write master]
hadoopmaster
Edit core-site.xml, yarn-site.xml, mapred-site.xml, hdfs-site.xml
$ cd /usr/local/hadoop/etc/hadoop
core-site.xml
$ sudo gedit core-site.xml
Replace localhost as hadoopmaster in configuration
hdfs-site.xml
$ sudo gedit hdfs-site.xml
replace value 1 as 3 (represents no of datanode) in configuration
yarn-site.xml
$ sudo gedit yarn-site.xml
Add the following configuration without modifying existing
<configuration>
            <property>
                        <name>yarn.resourcemanager.resource-tracker.address</name>
                        <value>hadoopmaster:8025</value>
            </property>
            <property>
                        <name>yarn.resourcemanager.scheduler.address</name>
                        <value>hadoopmaster:8030</value>
            </property>
            <property>
                        <name>yarn.resourcemanager.address</name>
                        <value>hadoopmaster:8050</value>
            </property>
            <property>
                         <name>yarn.resourcemanager.hostname</name>
                        <value><ResourceManager hostname></value>
                        <description>The hostname of the RM.</description>
             </property>
            <property>
                        <name>mapreduce.framework.name</name>
                        <value>yarn</value>
            </property>
</configuration>
mapred-site.xml
$ sudo gedit mapred-site.xml
<property>
            <name>mapreduce.jobhistory.address</name>
            <value>hadoopmaster:10020</value>
</property>

<property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>hadoopmaster:19888</value>
</property>
Manage directory for NameNode and DataNode
If one want to make Master Node as NameNode and DataNode both then create dedicate directory inside hadoop for both otherwise if want Master Node dedicated to only NameNode then create directory for only NameNode and for all the Slave Nodes create directory for only DataNode.
We keep Master Node as a NameNode only so remove the directory dedicated to DataNode.
$ sudo rm -rf /usr/local/hadoop/hadoop_data
Create Slave Nodes
Shut down master node and create Slave Nodes slvae1, slave2, etc. by cloning Master Node. Clone hadoopmaster as hadoopslave1, hadoopslave2, hadoopslave3 … as many number of Slave Nodes required. Power-on the Master Node and all the Slave Nodes.
Configure Slave Nodes
@ Slave Node [for each slave nodes]
Change host name
$ sudo gedit /etc/hostname
hadoopslave<nodenumberhere>          e.g. hadoopslave1, hadoopslave2
Create Directory for DataNode and Change the Ownership
$ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
$ sudo chown -R hduser:hadoop /usr/local/hadoop
Edit hdfs-site.xml
$ sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Remove dfs.namenode.data.dir property section.
Reboot all Slave Nodes
$ sudo reboot
Configure Master Node
@ Master Node
Edit masters file
$ sudo gedit /usr/local/hadoop/etc/hadoop/masters                [have not used in my l]
Remove any existing contents and write
hadoopmaster
Edit slaves file
$ sudo gedit /usr/local/hadoop/etc/hadoop/slaves
Remove localhost and add followings
hadoopmaster  [if you want to keep master as slave too]
hadoopslave1
hadoopslave2
hadoopslave3
Edit hdfs-site.xml
$ sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml
Remove dfs.datanode.data.dir property section in configuration since Master Node dedicated to only NameNode.
Create Directory for NameNode and Change the Ownership
$ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
$ sudo chown -R hduser:hadoop /usr/local/hadoop
Copy shh rsa/dsa keys to Master Node and Slave Nodes
$ ssh-copy-id -i ~/.ssh/id_dsa.pub hduser@hadoopmaster
$ ssh-copy-id -i ~/.ssh/id_dsa.pub hduser@hadoopslave1
$ ssh-copy-id -i ~/.ssh/id_dsa.pub hduser@hadoopslave2
$ ssh-copy-id -i ~/.ssh/id_dsa.pub hduser@hadoopslave3
            [for rsa key]
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoopslave3
Or
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
Login to Each Node for Verification
$ ssh hadoopmaster
$ exit
$ ssh hadoopslave1
$ exit
$ ssh hadoopslave2
$ exit
$ ssh hadoopslave3
$ exit
Format the NameNode and DataNodes
$ hadoop namenode –format
$ hadoop datanode –format   [on datanodes]
$ hdfs namenode -foramt
Start the Daemon Processes and Check them
$ start-all.sh
Or
$ start-dfs.sh && start-yarn.sh
$ jps [login to each nodes and run jps to check]
Starting Job History Server
$ mr-jobhistory-daemon.sh start historyserver
Open the Web URI
http://hadoopmaster:8088/
http://hadoopmaster:50070/
http://hadoopmaster:50090/
http://hadoopmaster:50075/
http://hadoopmaster:19888/

Eclipse Installation
$ sudo tar xzf eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz
$ sudo mv eclipse /opt/
$ sudo gedit /usr/share/applications/eclipse.desktop
            add these lines
            [Desktop Entry]
            Name=Eclipse
            Type=Application
            Exec=/opt/eclipse/eclipse
            Terminal=false
            Icon=/opt/eclipse/icon.xpm
            Comment=Integrated Development Environment
            NoDisplay=false
            Categories=Development;IDE;
            Name[en]=eclipse.desktop

create a symlink
$ cd /usr/local/bin
$ sudo ln -s /opt/eclipse/eclipse
launch the eclipse
$ /opt/eclipse/eclipse -clean &
Optional::::::
If some problem then check the executable file permission
$ sudo chmod +x /opt/eclipse/eclipse
$ uname –i
Hadoop-2.4-Eclipse Plugin Build
Install ant tool to build this plugin
$ sudo apt-get install ant
Download and Extract
Download eclipse plugin for hadoop 2.x.x from the link
https://github.com/winghc/hadoop2x-eclipse-plugin
Extract to a local directory
$ sudo unzip hadoop2x-eclipse-plugin-master.zip
Build using ant
$ cd hadoop2x-eclipse-plugin-master/
$ cd src/
$ cd contrib/
$ cd eclipse-plugin/
$ sudo ant jar -Dversion=2.6.0 -Declipse.home=/opt/eclipse -Dhadoop.home=/usr/local/hadoop
On successful build,
$ cd hadoop2x-eclipse-plugin-master/
$ cd build/
$ cd contrib/
$ cd eclipse-plugin/
There is jar file named "hadoop-eclipse-plugin-2.6.0.jar"
Copy jar file to /opt/eclipse/plugins
$ sudo cp hadoop-eclipse-plugin-2.6.0.jar /opt/eclipse/plugins/
Configure in Eclipse IDE
Go to Window --> Open Perspective --> Other and select 'Map/Reduce' perspective
Add the Server. In the Map/Reduce Locations panel, click on the elephant logo in the upper-right corner to add a new server to Eclipse.
Define Hadoop Location:
Location Name: localhost [in mulit-node cluster use hadoopmaster instead of localhost]
Map/Reduce(V2)Master:        Host: localhost; Port: 50020(default) changed to 9001
DFS Master:                Host: localhost; Port: 50040(default) changed to 9000

Running MapReduce job on Hadoop Cluster
$ start-all.sh
$ jps
$ cd
$ cd Desktop/
$ sudo mkdir www
$ cd www
$ jps >> example.txt
$ cd /usr/local/hadoop/
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hduser
$ bin/hdfs dfs -put /home/hduser/Desktop/www input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output
$ bin/hdfs dfs -cat output/*

Some Commands to Play with Hadoop
To check the architecture of the system
[If the output is x86_64 means system is 64 bit and if i386 then 32 bit]
$ uname -a | awk '{print $12}'
Checking version of Java and Hadoop
$ which java
$ hadoop version
$ hdfs version
Create users
$ cd /usr/local/hadoop/
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hduser
List the files in hdfs
$ bin/hdfs dfs -ls /
$ bin/hdfs dfs -ls /user
$ bin/hdfs dfs -ls /user/hduser
Copy files to hdfs from local system
$ bin/hdfs dfs -put /home/hduser/data input
$ bin/hdfs dfs -ls /user/hduser/input
$ bin/hdfs dfs -ls -R /user/hduser
Run the job
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output
##if we are in /usr/local/hadoop/ the this command also runs
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input output
See the output
$ bin/hdfs dfs -cat output/*
$ bin/hdfs dfs -get /user/hduser/output /home/hduser/outputdata
$ cat /home/hduser/outputdata/part-r-00000
Remove the output directory
$ bin/hdfs dfs -rm -r output
Commands for starting and stopping individual daemon processes
$ sbin/start-dfs.sh
$ sbin/start-yarn.sh
$ sbin/stop-dfs.sh
$ sbin/stop-yarn.sh
Commands for starting and stopping all daemon processes
$ sbin/start-all.sh
$ sbin/stop-all.sh
Tracking which data block is in which data node in hadoop
$ hadoop fsck / -files -blocks -locations
$ hadoop fsck /user/hduser/input-wordcount -files -blocks -locations -racks
Check default block size
$ hdfs getconf -confKey dfs.blocksize
Leave NameNode in Safe Mode
$ hadoop dfsadmin -safemode leave
$ hdfs dfsadmin –safemode leave
Checked the DFS with hadoop fsck /
$ hadoop fsck /

References
Many other online available public sources and blogs



No comments:

Post a Comment

Thanks for your comments.