Single Node Cluster
$ sudo gedit
/etc/apt/apt.conf
Add the
following lines
Acquire::http::Proxy
"http://user-id:password@proxy-address:port ";
Acquire::https::Proxy
"https://user-id:password@proxy-address:port";
Acquire::ftp::Proxy
"ftp://user-id:password@proxy-address:port ";
Update the Ubuntu
$ sudo apt-get update
Install JDK
$
sudo apt-get install default-jdk
Or
Alternative
$
sudo apt-get install openjdk-7-jdk
Check
version if already installed.
$
java –version
Select
particular Java as the default on your machine. See for more information
$
sudo update-alternatives --config java
$
sudo update-java-alternatives -s java-6-sun
$
update-alternatives --config java
[copy
/usr/lib/jvm/java-7-openjdk-amd64]
Adding
a dedicated Hadoop system user
$
sudo addgroup hadoop
$
sudo adduser --ingroup hadoop hduser
Configure
the sudo permissions for 'hduser'
$
sudo visudo
[Since
by default ubuntu text editor is nano we will need to use CTRL + O to edit]
[Add the permissions to sudoers i.e. add this
line]
hduser
ALL=(ALL) ALL
[Use CTRL + X keyboard shortcut to exit out. Enter Y
to save the file.]
Install ssh
$ sudo apt-get
install ssh
$ sudo apt-get
install rsync
$ ssh-keygen
-t dsa -P '' -f ~/.ssh/id_dsa
$
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Generate
an SSH key for the hduser user
$
su - hduser
$
ssh-keygen -t rsa -P ""
Enter file in which to
save the key (/home/hduser/.ssh/id_rsa):
Created directory
'/home/hduser/.ssh'.
Your identification
has been saved in /home/hduser/.ssh/id_rsa.
Your public key has
been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint
is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2
hduser@ubuntu
The key's randomart
image is:
[...snipp...]
hduser@ubuntu:~$
Second, you have to enable SSH access to your local
machine with this newly created key
$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
$ ssh localhost
Download and Install Hadoop-2.6.0
$ wget -c
[copied link of hadoop-2.6.0.tar.gz from mirror]
$ sudo mv
hadoop-2.6.0.tar.gz Desktop/
$ cd Desktop/
$ sudo tar
-zxvf hadoop-2.6.0.tar.gz
$
sudo mv hadoop-2.6.0 /usr/local/hadoop
$
sudo chown -R hduser:hadoop hadoop
Edit
bashrc file
$
sudo gedit ~/.bashrc
Add
the following lines at last in file
#Hadoop
Variables
export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export
HADOOP_HOME=/usr/local/hadoop
export
PATH=$PATH:$HADOOP_HOME/bin
export
PATH=$PATH:$HADOOP_HOME/sbin
export
HADOOP_MAPRED_HOME=$HADOOP_HOME
export
HADOOP_COMMON_HOME=$HADOOP_HOME
export
HADOOP_HDFS_HOME=$HADOOP_HOME
export
YARN_HOME=$HADOOP_HOME
export
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export
PATH=$PATH:$JAVA_HOME/bin
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
Source the variables using the source command
$ source ~/.bashrc
Edit hadoop-env.sh file
$
cd /usr/local/hadoop/etc/hadoop/
$
sudo gedit hadoop-env.sh
Add path for JAVA_HOME
#
the java implementation to use.
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
Add some more from other sites at last in file
#
To set the Hadoop installation directory
export
HADOOP_HOME=/usr/local/hadoop
export
HADOOP_PREFIX=/usr/local/hadoop
#
To set Hadoop native library directory
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
#
To disable IPv6 only for Hadoop
export
HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
#
To set the library directory
export
HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib"
Edit
yarn-env.sh file
$
cd /usr/local/hadoop/etc/hadoop/
$
sudo gedit yarn-env.sh
Add path for JAVA_HOME
export
JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64"
Create
Temporary Directory
Hadoop
default configuration uses hadoop.tmp.dir as the default base temporary
directory both for the local file system and HDFS. To use other directory,
create the directory and set required ownership and permission. If we do this
then we must have add it in core-site.xml.
$
sudo mkdir -p /app/hadoop/tmp
$
sudo chown -R hduser:hadoop /app/hadoop/tmp
If
you want to tighten up security, chmod from 755 to 750
#$
sudo chmod 0750 /app/hadoop/tmp
$
sudo chmod -R 755 /app/hadoop/tmp
Edit
core-site.xml, yarn-site.xml, mapred-site.xml, hdfs-site.xml
$
cd /usr/local/hadoop/etc/hadoop/
core-site.xml
$
sudo gedit core-site.xml
Add
the following configuration
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Add
the following if default base temporary directory is changed.
yarn-site.xml
$
sudo gedit yarn-site.xml
Add
the following configurations
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>
org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
mapred-site.xml
$ sudo cp
mapred-site.xml.template mapred-site.xml
$
sudo gedit mapred-site.xml
Add
the following configuration
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
Add the
following configurations. For default 128 MB (value written in bytes) for 64 MB
(change it to 67108864)
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
</configuration>
Make
the directory for namenode and datanode and change the ownership of hadoop
$
cd
$
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
$ sudo mkdir -p
/usr/local/hadoop/hadoop_data/hdfs/datanode
$ sudo chown hduser:hadoop -R
/usr/local/hadoop
Format
the namenode
$
hdfs namenode -format
[if there is some error, check
bashrc file and command: source ~/.bashrc]
Start
the daemon process and check them
$
start-all.sh
$ jps
Access
the Web UIs
Verify
Hadoop installation With Web Interfaces. Open the browser with the following
URL’s
HDFS-UI:
http://localhost:50070 HDFS-UI
Resource
Mnager UI: http://localhost:8088 ResourceManager
Job History UI: http://localhost:19888
NodeManager
Multi
Node Cluster
Clone
the single node installed Hadoop as master and slave1, slave2 and so on and put
them in vm directory. To this first make clone of single node cluster as master
node and configure it and then clone this master node as various slave nodes
and configure these nodes.
Create
Master Node
Create
master node from single node cluster by cloning, power on master node and open
terminal.
@Master
Node
Find
inet address
$
ifconfig
Let:
Namenode
> hadoopmaster > 192.168.23.130
Datanodes
> hadoopslave1 > 192.168.23.131
hadoopslave2 > 192.168.23.132
Edit
hosts file
$
sudo gedit /etc/hosts
Add
the name and address of nodes in the file
192.168.23.130 hadoopmaster
192.168.23.131 hadoopslave1
192.168.23.132 hadoopslave2
Edit
hostname file
$
sudo gedit /etc/hostname [reboot]
[edit
and write master]
hadoopmaster
Edit
core-site.xml, yarn-site.xml, mapred-site.xml, hdfs-site.xml
$
cd /usr/local/hadoop/etc/hadoop
core-site.xml
$
sudo gedit core-site.xml
Replace
localhost as hadoopmaster in configuration
hdfs-site.xml
$
sudo gedit hdfs-site.xml
replace
value 1 as 3 (represents no of datanode) in configuration
yarn-site.xml
$
sudo gedit yarn-site.xml
Add
the following configuration without modifying existing
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopmaster:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopmaster:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopmaster:8050</value>
</property>
<property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value><ResourceManager
hostname></value>
<description>The
hostname of the RM.</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
mapred-site.xml
$
sudo gedit mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoopmaster:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoopmaster:19888</value>
</property>
Manage
directory for NameNode and DataNode
If
one want to make Master Node as NameNode and DataNode both then create dedicate
directory inside hadoop for both otherwise if want Master Node dedicated to
only NameNode then create directory for only NameNode and for all the Slave
Nodes create directory for only DataNode.
We
keep Master Node as a NameNode only so remove the directory dedicated to
DataNode.
$
sudo rm -rf /usr/local/hadoop/hadoop_data
Create
Slave Nodes
Shut
down master node and create Slave Nodes slvae1, slave2, etc. by cloning Master
Node. Clone hadoopmaster as hadoopslave1, hadoopslave2, hadoopslave3 … as many
number of Slave Nodes required. Power-on the Master Node and all the Slave
Nodes.
Configure
Slave Nodes
@
Slave Node [for each slave nodes]
Change
host name
$
sudo gedit /etc/hostname
hadoopslave<nodenumberhere> e.g. hadoopslave1, hadoopslave2
Create
Directory for DataNode and Change the Ownership
$ sudo mkdir -p
/usr/local/hadoop/hadoop_data/hdfs/datanode
$ sudo chown
-R hduser:hadoop /usr/local/hadoop
Edit
hdfs-site.xml
$ sudo gedit
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
Remove
dfs.namenode.data.dir property section.
Reboot
all Slave Nodes
$ sudo reboot
Configure
Master Node
@
Master Node
Edit
masters file
$ sudo gedit
/usr/local/hadoop/etc/hadoop/masters [have
not used in my l]
Remove
any existing contents and write
hadoopmaster
Edit
slaves file
$ sudo gedit
/usr/local/hadoop/etc/hadoop/slaves
Remove
localhost and add followings
hadoopmaster [if you want to keep master as slave too]
hadoopslave1
hadoopslave2
hadoopslave3
Edit
hdfs-site.xml
$ sudo gedit
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
Remove
dfs.datanode.data.dir property section in configuration since Master Node
dedicated to only NameNode.
Create
Directory for NameNode and Change the Ownership
$ sudo mkdir
-p /usr/local/hadoop/hadoop_data/hdfs/namenode
$ sudo chown
-R hduser:hadoop /usr/local/hadoop
Copy
shh rsa/dsa keys to Master Node and Slave Nodes
$ ssh-copy-id
-i ~/.ssh/id_dsa.pub hduser@hadoopmaster
$ ssh-copy-id
-i ~/.ssh/id_dsa.pub hduser@hadoopslave1
$ ssh-copy-id
-i ~/.ssh/id_dsa.pub hduser@hadoopslave2
$ ssh-copy-id
-i ~/.ssh/id_dsa.pub hduser@hadoopslave3
[for rsa key]
$ ssh-copy-id
-i ~/.ssh/id_rsa.pub hduser@hadoopslave3
Or
Login
to Each Node for Verification
$ ssh
hadoopmaster
$ exit
$ ssh
hadoopslave1
$ exit
$ ssh
hadoopslave2
$ exit
$ ssh
hadoopslave3
$ exit
Format
the NameNode and DataNodes
$ hadoop
namenode –format
$ hadoop
datanode –format [on datanodes]
$ hdfs
namenode -foramt
Start
the Daemon Processes and Check them
$ start-all.sh
Or
$ start-dfs.sh
&& start-yarn.sh
$ jps [login
to each nodes and run jps to check]
Starting
Job History Server
$
mr-jobhistory-daemon.sh start historyserver
Open
the Web URI
http://hadoopmaster:8088/
http://hadoopmaster:50070/
http://hadoopmaster:50090/
http://hadoopmaster:50075/
http://hadoopmaster:19888/
Eclipse Installation
$ sudo tar xzf
eclipse-standard-kepler-SR2-linux-gtk-x86_64.tar.gz
$ sudo mv
eclipse /opt/
$ sudo gedit /usr/share/applications/eclipse.desktop
add these lines
[Desktop Entry]
Name=Eclipse
Type=Application
Exec=/opt/eclipse/eclipse
Terminal=false
Icon=/opt/eclipse/icon.xpm
Comment=Integrated Development
Environment
NoDisplay=false
Categories=Development;IDE;
Name[en]=eclipse.desktop
create
a symlink
$ cd
/usr/local/bin
$ sudo ln -s
/opt/eclipse/eclipse
launch
the eclipse
$
/opt/eclipse/eclipse -clean &
Optional::::::
If some
problem then check the executable file permission
$ sudo chmod
+x /opt/eclipse/eclipse
$ uname –i
Hadoop-2.4-Eclipse
Plugin Build
Install
ant tool to build this plugin
$ sudo apt-get
install ant
Download
and Extract
Download
eclipse plugin for hadoop 2.x.x from the link
https://github.com/winghc/hadoop2x-eclipse-plugin
Extract
to a local directory
$ sudo unzip
hadoop2x-eclipse-plugin-master.zip
Build
using ant
$ cd
hadoop2x-eclipse-plugin-master/
$ cd src/
$ cd contrib/
$ cd
eclipse-plugin/
$ sudo ant jar
-Dversion=2.6.0 -Declipse.home=/opt/eclipse -Dhadoop.home=/usr/local/hadoop
On
successful build,
$ cd
hadoop2x-eclipse-plugin-master/
$ cd build/
$ cd contrib/
$ cd
eclipse-plugin/
There
is jar file named "hadoop-eclipse-plugin-2.6.0.jar"
Copy
jar file to /opt/eclipse/plugins
$ sudo cp
hadoop-eclipse-plugin-2.6.0.jar /opt/eclipse/plugins/
Configure
in Eclipse IDE
Go to
Window --> Open Perspective --> Other and select 'Map/Reduce' perspective
Add the
Server. In the Map/Reduce Locations panel, click on the elephant logo in the
upper-right corner to add a new server to Eclipse.
Define Hadoop
Location:
Location Name:
localhost [in mulit-node cluster use hadoopmaster instead of localhost]
Map/Reduce(V2)Master:
Host: localhost; Port:
50020(default) changed to 9001
DFS Master: Host: localhost; Port: 50040(default)
changed to 9000
Running MapReduce job on Hadoop Cluster
$ start-all.sh
$ jps
$ cd
$ cd Desktop/
$ sudo mkdir
www
$ cd www
$ jps >>
example.txt
$ cd
/usr/local/hadoop/
$ bin/hdfs dfs
-mkdir /user
$ bin/hdfs dfs
-mkdir /user/hduser
$ bin/hdfs dfs
-put /home/hduser/Desktop/www input
$ bin/hadoop
jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input
output
$ bin/hdfs dfs
-cat output/*
Some Commands to Play with Hadoop
To
check the architecture of the system
[If the output
is x86_64 means system is 64 bit and if i386 then 32 bit]
$ uname -a |
awk '{print $12}'
Checking
version of Java and Hadoop
$ which java
$ hadoop
version
$ hdfs version
Create
users
$ cd
/usr/local/hadoop/
$ bin/hdfs dfs
-mkdir /user
$ bin/hdfs dfs
-mkdir /user/hduser
List
the files in hdfs
$ bin/hdfs dfs
-ls /
$ bin/hdfs dfs
-ls /user
$ bin/hdfs dfs
-ls /user/hduser
Copy
files to hdfs from local system
$ bin/hdfs dfs
-put /home/hduser/data input
$ bin/hdfs dfs
-ls /user/hduser/input
$ bin/hdfs dfs
-ls -R /user/hduser
Run the
job
$ bin/hadoop
jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input
output
##if we are in
/usr/local/hadoop/ the this command also runs
$ hadoop jar
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount input
output
See the
output
$ bin/hdfs dfs
-cat output/*
$ bin/hdfs dfs
-get /user/hduser/output /home/hduser/outputdata
$ cat
/home/hduser/outputdata/part-r-00000
Remove
the output directory
$ bin/hdfs dfs
-rm -r output
Commands
for starting and stopping individual daemon processes
$
sbin/start-dfs.sh
$
sbin/start-yarn.sh
$
sbin/stop-dfs.sh
$
sbin/stop-yarn.sh
Commands
for starting and stopping all daemon processes
$
sbin/start-all.sh
$
sbin/stop-all.sh
Tracking
which data block is in which data node in hadoop
$ hadoop fsck
/ -files -blocks -locations
$ hadoop fsck
/user/hduser/input-wordcount -files -blocks -locations -racks
Check
default block size
$ hdfs getconf
-confKey dfs.blocksize
Leave
NameNode in Safe Mode
$ hadoop
dfsadmin -safemode leave
$ hdfs
dfsadmin –safemode leave
Checked
the DFS with hadoop fsck /
$ hadoop fsck
/
References
Many other
online available public sources and blogs
No comments:
Post a Comment
Thanks for your comments.