sameer@home:~$

Building a Distributed Hadoop Cluster with HBase on Amazon EC2’s from Scratch

Buzz Words

If you want to build a Distributed Hadoop Cluster on AWS EC2 with HBase, then the best option is to use AWS EMR. But if you are like me and want to build your cluster from scratch then, you are in the right place. There can be many reasons to build your clusters, for me, it is to understand as to how the connections happen between master and slaves and dig deeper into systems. Or if you want to just tweak come config or code which is not available in EMR and run the production load.

Hadoop Prerequisites

First thing first, let us launch 4 EC2 instances (1 master and 3 slaves). For simplification, I am not using the bastion host, enabled public IP and have restricted the security group to my IP only. As soon as the instances are ready, label them as master, slave1, slave2 and slave3. I have used ubuntu 18.04 AMI and following is my ~/.ssh/config file.

#OnLocal

Host master
  User ubuntu
  IdentityFile ~/.aws/key.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@ec2-x-x-x-x.us-west-2.compute.amazonaws.com -i ~/.aws/key.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.x

Host slave1
  User ubuntu
  IdentityFile ~/.aws/bds.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@x-x-x-x.us-west-2.compute.amazonaws.com -i ~/.aws/key.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.x

Host slave2
  User ubuntu
  IdentityFile ~/.aws/bds.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@x-x-x-x.us-west-2.compute.amazonaws.com -i ~/.aws/key.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.x

Host slave3
  User ubuntu
  IdentityFile ~/.aws/key.pem
  ProxyCommand ssh -q -W %h:%p ubuntu@x-x-x-x.compute.amazonaws.com -i ~/.aws/bds.pem
  StrictHostKeyChecking no
  UserKnownHostsFile=/dev/null
  HostName 10.0.x.x

Remember to replace the ubuntu@x-x-x-x.us-west-2.compute.amazonaws.com with an actual public DNS of ec2 instances and also change the HostName to private IPv4.

Now with ~/.ssh/config configured, I can simply ssh into the box using ssh master or ssh slave1 etc.

Let us update before we proceed with the installation of Hadoop.

#OnAll

sudo apt update && sudo apt dist-upgrade -y

Next, set the static hostname for the master.

#OnMaster

sudo hostnamectl set-hostname --static master

Similarly for the slaves too.

#OnSlaves

sudo hostnamectl set-hostname --static slave1
sudo hostnamectl set-hostname --static slave2
sudo hostnamectl set-hostname --static slave3

Now open the file sudo vim /etc/cloud/cloud.cfg and configure the below property.

#OnAll

preserve_hostname=true

Now update the /etc/hosts file with the private IP’s of master and slave instances. Here is the list of my IP’s. Remember to remove the first line 127.0.0.1 localhost from the /etc/hosts file.

#OnAll

sudo vim /etc/hosts
10.0.6.80     master
10.0.6.174    slave1
10.0.6.252    slave2
10.0.6.35     slave3

Next, let’s install OpenJDK 8 on the instances

#OnAll

sudo apt install openjdk-8-jdk openjdk-8-jre -y

Now will reboot the instance

#OnAll

sudo reboot

Now let us enable and set a password for the instances. This is being done to make it easier to transfer the files between master and slave nodes. We can always disable it later.

Now, make the below changes on all the nodes

#OnAll

sudo vim /etc/ssh/sshd_config

# Set the below value in the file
PasswordAuthentication yes

Then restart the sshd and set the password for the ubuntu user.

#OnAll

sudo service ssh restart
sudo passwd ubuntu
# Enter the password and remember it for future use

Now Execute next instructions only on the master node. We will generate public/private ssh key and copy it to all the slaves from the master.

#OnMaster

ssh-keygen -b 4096

ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@master
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave2
ssh-copy-id -i $HOME/.ssh/id_rsa.pub ubuntu@slave3

A Note if the above command ssh-copy-id is freezing, Then you need to take a look at the security group to make sure that you can ssh from master to slave. One way to do it is, to have a common security group (SG) attached to all the instances and add that SG to itself.

Let’s install and setup Hadoop!

On Master Node,

Run the below commands to download, untar and copy Hadoop to /usr/local/hadoop . Let us also give permission to the same directory.

#OnMaster

wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
sudo mkdir /usr/local/hadoop
tar -xzf hadoop-2.7.3.tar.gz
sudo mv hadoop-2.7.3/* /usr/local/hadoop/
ls -ltr /usr/local/hadoop/
sudo chown -R ubuntu:ubuntu /usr/local/hadoop

Now let us add some paths to ~/.bashrc file, to make it easier to navigate going forward. Setting PATH’s will always make it easier. Append the below config to your ~/.bashrc file.

#OnMaster

# Java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

# Hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Once saved, run the command source ~/.bashrc to update it in the current session.

Now let us make the change to the Hadoop environment to enable it for distributed mode. There is a long list of changes coming ahead.

First, let’s go into cd /usr/local/hadoop/etc/hadoop directory.

  • Set the JAVA_HOME in hadoop-env.sh.
      #OnMaster
        
      export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    
  • Add the YARN config in yarn-site.xml.
      #OnMaster
       
      <property>
          <name>yarn.acl.enable</name>
          <value>0</value>
      </property>
      <property>
          <name>yarn.resourcemanager.hostname</name>
          <value>master</value>
      </property>
      <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
      </property>
      <property>
          <name>yarn.nodemanager.resource.memory-mb</name>
          <value>2826</value>
      </property>
      <property>
          <name>yarn.scheduler.maximum-allocation-mb</name>
          <value>2726</value>
      </property>
      <property>
          <name>yarn.scheduler.minimum-allocation-mb</name>
          <value>128</value>
      </property>
      <property>
          <name>yarn.nodemanager.vmem-check-enabled</name>
          <value>false</value>
      </property>
      <property>
          <name>yarn.nodemanager.resource.cpu-vcores</name>
          <value>2</value>
      </property>
    

    Make sure that the above properties are added between <configuration> and </configuration>.

  • Add the below config to core-site.xml.
      #OnMaster
       
      <property>
          <name>fs.default.name</name>
          <value>hdfs://master:9000</value>
      </property>
    
  • Add HDFS config to hdfs-site.xml.
      #OnMaster
       
      <property>
          <name>dfs.replication</name>
          <value>3</value>
          <description>
              Default block replication.
              The actual number of replications can be specified when the file is created.
              The default is used if replication is not specified in create time.
          </description>
      </property>
      <property>
          <name>dfs.namenode.name.dir</name>
          <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
      </property>
      <property>
          <name>dfs.datanode.data.dir</name>
          <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
      </property>
    
  • Now, let us add Map-Reduce config to mapred-site.xml. But first, we need to copy the template. This can be done by
      #OnMaster
       
      cp mapred-site.xml.template mapred-site.xml
    

    Then open the file mapred-site.xml and add the below config

      #OnMaster
       
      <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
      </property>
      <property>
              <name>yarn.app.mapreduce.am.resource.mb</name>
              <value>1024</value>
      </property>
      <property>
              <name>mapreduce.map.memory.mb</name>
              <value>1024</value>
      </property>
      <property>
              <name>mapreduce.reduce.memory.mb</name>
              <value>1024</value>
      </property>
    

    If you are interested in understanding what each property does, then please refer the official documentation for Hadoop.

  • Let us create a directory for HDFS store.
      #OnMaster
       
      sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
      sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
      sudo chown -R ubuntu:ubuntu /usr/local/hadoop_store/
    
  • Now we have made all the configuration we needed on the master, let us make the changes on the slave node. But first, let us copy the hadoop-2.7.3.tar.gz to slave nodes.
      #OnMaster
       
      cd ~
      scp hadoop-2.7.3.tar.gz ubuntu@slave1:/home/ubuntu/
      scp hadoop-2.7.3.tar.gz ubuntu@slave2:/home/ubuntu/
      scp hadoop-2.7.3.tar.gz ubuntu@slave3:/home/ubuntu/
    
  • Once we have copied the files to slave nodes, it is time to make similar changes there as well.

    On Slaves,

      #OnSlaves
       
      sudo mkdir /usr/local/hadoop
      tar -xzf hadoop-2.7.3.tar.gz
      sudo mv hadoop-2.7.3/* /usr/local/hadoop/
      ls -ltr /usr/local/hadoop/
      sudo chown -R ubuntu:ubuntu /usr/local/hadoop
    
  • Now let us add some paths to ~/.bashrc file in slaves as well.

      #OnSlaves
       
      # Java
      export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
      export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
       
      # Hadoop
      export HADOOP_INSTALL=/usr/local/hadoop
      export PATH=$PATH:$HADOOP_INSTALL/bin
      export PATH=$PATH:$HADOOP_INSTALL/sbin
      export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
      export HADOOP_COMMON_HOME=$HADOOP_INSTALL
      export HADOOP_HDFS_HOME=$HADOOP_INSTALL
      export YARN_HOME=$HADOOP_INSTALL
      export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
      export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
    

    Once saved, run the command source ~/.bashrc to update it in the current session.

  • Let us create a directory for HDFS store in the slave as well.

      #OnSlaves
       
      sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
      sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
      sudo chown -R ubuntu:ubuntu /usr/local/hadoop_store/
    

    Now that we have configured slaves, let us move back to the master node.

    On Master,

  • Let us copy all the changes we did to Hadoop Configuration to Slaves as well.

     #OnMaster
        
     scp /usr/local/hadoop/etc/hadoop/* ubuntu@slave1:/usr/local/hadoop/etc/hadoop/
     scp /usr/local/hadoop/etc/hadoop/* ubuntu@slave2:/usr/local/hadoop/etc/hadoop/
     scp /usr/local/hadoop/etc/hadoop/* ubuntu@slave3:/usr/local/hadoop/etc/hadoop/
    
  • Let us add the slave names to the /usr/local/hadoop/etc/hadoop/slaves file.
      #OnMaster
        
      slave1
      slave2
      slave3
    

    Now we are done with Hadoop Setup.

Let us start the Hadoop and Test!

On Master again,

  • Let us format namenode,

      #OnMaster
       
      hdfs namenode -format
    
  • Then we will start the DFS
      #OnMaster
       
      start-dfs.sh
    
  • To check if everything is configured properly, let us check jps
      #OnMaster
       
      jps
    
  • On the Master node, the output should be
      #OnMaster
       
      4448 SecondaryNameNode
      4572 Jps
      4175 NameNode
    
  • And on the slave nodes
      #OnSlaves
       
      2632 DataNode
      2713 Jps
    

    Of course, the PID’s will be different.

  • Now let us start YARN as well
      #OnMaster
       
      start-yarn.sh
    
  • Now when we run jps , we should see ResourceManager getting added to the list in master and NodeManager in slaves.

      #OnMaster
       
      4448 SecondaryNameNode
      4631 ResourceManager
      4895 Jps
      4175 NameNode
       
     #OnSlaves
       
     2832 NodeManager
     2632 DataNode
     2943 Jps
    

    And also you see if everything is running fine is the UI as well. First, find out the public IP for the master node. DFS UI is on the port 50070 and YARN UI is on 8088.

DFS: http://:50070
DFS Dashboard
YARN: http://:8088/cluster
YARN Dashboard

Now we can run a small Map Reduce Job of counting the words in a text. Execute the below commands in the Master (of course) to run the job.

#OnMaster

cd ~
mkdir sample_data
cd sample_data/
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
cd ~
hadoop fs -copyFromLocal sample_data/alice.txt hdfs://master:9000/
hdfs dfs -ls hdfs://master:9000/
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount "hdfs://master:9000/alice.txt" hdfs://master:9000/output/
While the job is running you can also visit the YARN UI to track the Job.

HBase

  • Let us download the HBase and then untar and copy the files to /usr/local/hbase directory. The official link to download the HBase can be found here. From there you have to find the mirror and download it. Please feel free the replace the link based on your location. Also, I have tested with HBase version 1.4.13.
      #OnMaster
      
      wget https://downloads.apache.org/hbase/1.4.13/hbase-1.4.13-bin.tar.gz
      tar -zxvf hbase-1.4.13-bin.tar.gz
      sudo mv hbase-1.4.13 /usr/local/hbase
      sudo chown -R ubuntu:ubuntu /usr/local/hbase
    
  • Add the below commands to ~/.bashrc file
      #OnMaster
      
      export HBASE_HOME=/usr/local/hbase
      export PATH=$PATH:$HBASE_HOME/bin
    

    Then, as usual source the file. source ~/.bashrc.

  • Update the HBase Config files
      #OnMaster
      
      cd /usr/local/hbase/conf/
    
  • Set the JAVA_HOME in hbase-env.sh file.
      #OnMaster
      
      export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    
  • Then, proceed to add the properties to hbase-site.xml file.
      #OnMaster
      
      <property>
            <name>hbase.rootdir</name>
            <value>hdfs://master:9000/hbase</value>
         </property>
      <property>
            <name>hbase.zookeeper.property.dataDir</name>
            <value>hdfs://master:9000/zookeeper</value>
         </property>
      <property>
           <name>hbase.cluster.distributed</name>
           <value>true</value>
         </property>
      <property>
          <name>hbase.zookeeper.quorum</name>
          <value>slave1,slave2,slave3</value>
      </property>
    
  • Then add the slave names to regionservers
      #OnMaster
      
      slave1
      slave2
      slave3
    
  • Now On the Slave Nodes, make the create /usr/local/hbase directory and give permissions so that we can copy the files from the master to slave nodes. Make the following changes on the slaves.
      #OnSlaves
      
      sudo mkdir -p /usr/local/hbase
      sudo chown -R ubuntu:ubuntu /usr/local/hbase
    
  • Now, from the master again, we need to copy the HBase file to slaves. Run the below commands to do that.
      #OnMaster
      
      scp -rp /usr/local/hbase/* ubuntu@slave1:/usr/local/hbase/
      scp -rp /usr/local/hbase/* ubuntu@slave2:/usr/local/hbase/
      scp -rp /usr/local/hbase/* ubuntu@slave3:/usr/local/hbase/
    
  • Now it is time to start the HBase. Make sure that the Hadoop is running, before you start the HBase.
      #OnMaster
      
      start-hbase.sh
    
  • When you run jps command,
      #OnMaster
      
      7616 NameNode
      7891 SecondaryNameNode
      8851 Jps
      8581 HMaster
      8056 ResourceManager
      
      #OnSlave
      
      4741 DataNode
      5270 HQuorumPeer
      5614 Jps
      5438 HRegionServer
      4927 NodeManager
    

    The HBase also has a UI to see the RegionServers information among other things.

HBase UI: http://:16010/master-status
HBase Dashboard

Hadoop and HBase are up and running in the distributed mode with a replication factor of 3 on the AWS EC2 instances!

References
  • https://medium.com/@zaman.nuces/setting-up-fully-distributed-hadoop-cluster-series-2-of-3-27b9831c25ae
  • https://aws.amazon.com/premiumsupport/knowledge-center/linux-static-hostname-rhel7-centos7/
  • https://www.digitalocean.com/community/questions/ssh-copy-id-not-working-permission-denied-publickey
  • https://aws.amazon.com/premiumsupport/knowledge-center/ec2-password-login/
  • https://medium.com/@yzhong.cs/hbase-installation-step-by-step-guide-cb73381a7a4c
  • https://unsplash.com/photos/gpjvRZyavZc
  • https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
  • https://hbase.apache.org/book.html#configuration