Components of Hadoop

Hadoop has several daemon programs:

  • Hadoop File System (HDFS)
    • NameNode
    • SecondaryNameNode
    • DataNode
  • YARN resource manager
    • ResourceManager
    • NodeManager
    • WebAppProxy
  • Map-Reduce
    • Job History Server

For large installations, these daemon programs are generally arranged to run on separate hosts.

Setting Up Hadoop

We set up a cluster of virtual machines and install and configure Hadoop on the cluster. In this tutorial, we use Oracle VirtualBox and Debian Linux systems.

  1. Since we are using virtual machines, to save disk space and effort, let’s create a Linux system on the virtual machine from which we can make clones.
    1. Create a virtual machine. Although we use virtual machines, the steps below are also applicable to physical machines.
    2. Set up Linux systems on the machine.
      • When setting up Linux systems, elect minimal installation without GUI. In this way, we can configure machines with small footprint (small RAM and disk).
      • The author uses the Debian Linux distribution and test the rest only on Debian Linux systems.
      • The author also installed the sudo package, and add the user to the sudo group. For instance, to achieve these, open a terminal, and run
        su -c "apt install sudo"
        su -c "/sbin/usermod -aG sudo $(whoami)"
        

        After that, exit the terminal.

    3. To make security assurance easier, we run Hadoop under a dedicated user. In this guide, we run Hadoop under user hadoop.
      # add the hadoop user
      sudo adduser hadoop
      
    4. We also run the rest of the setup under user hadoop. To become user hadoop and go to the home directory of user hadoop, we run:
      sudo -i -u hadoop
      cd
      
    5. Install necessary packages and download Hadoop binary. Then as user hadoop, do the following:
      # ensure the Linux system is up-to-date
      sudo apt update && sudo apt upgrade -y
      # install JDK
      sudo apt install -y default-jdk
      # ensure to install openssh server and client
      sudo apt install openssh-server openssh-client -y
      # ensure openssh server is running
      sudo systemctl start ssh && sudo systemctl enable ssh
      # ensure shell access without password
      #   i.e., ensure ssh public/private key access between machines in the cluster 
      ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
      ssh-copy-id -i ~/.ssh/id_rsa $(whoami)@localhost
      ssh -i ~/.ssh/id_rsa2 $(whoami)@localhost ls -a
      chmod 700 ~/.ssh
      chmod 600 ~/.ssh/authorized_keys
      # Download hadoop binary: https://hadoop.apache.org/releases.html
      wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz
      
  2. Plan for cluster network
    1. The cluster consists of the following hosts
      • master
      • worker1
      • worker2
      • worker3

      For that we update the /etc/hosts file, for instance, in this tutorial, the content of the file after the update will be:

      $ cat /etc/hosts
      127.0.0.1       localhost
      
      # The following lines are desirable for IPv6 capable hosts
      ::1     localhost ip6-localhost ip6-loopback
      ff02::1 ip6-allnodes
      ff02::2 ip6-allrouters
      
      192.168.56.3 hdbase
      192.168.56.4 master
      192.168.56.5 worker1
      192.168.56.6 worker2
      192.168.56.7 worker3
      192.168.56.8 worker4
      192.168.56.9 worker5
      192.168.56.10 worker6
      192.168.56.11 worker7
      192.168.56.12 worker8
      $
      
  3. Configure for NameNode and DataNode
    1. Create setting files
      # make HDFS file directory for NameMode - only needed for NameNode
      sudo mkdir -p /opt/hadoop_data/hdfs/namenode
      # make HDFS file directory for DataNode - only needed for v
      sudo mkdir -p /opt/hadoop_data/hdfs/datanode
      # change ownship to hadoop
      sudo chown -R hadoop:hadoop /opt/hadoop_data/
      
      # load hadoop configuration
      cat >> ~/.bashrc <<END
      
      # load hadoop settings
      . .hadoop_settings
      END
      

      Create ` ~/.hadoop_settings` file with the following content:

      export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
      export HADOOP_HOME=/opt/hadoop
      export HADOOP_INSTALL=$HADOOP_HOME
      export YARN_HOME=$HADOOP_HOME
      export PATH=$PATH:$HADOOP_INSTALL/bin
      export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
      export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
      
    2. Install NameNode software
      # extract hadoop binary
      sudo tar -xzvf hadoop-3.4.2.tar.gz -C /opt
      chown -R hadoop:hadoop /opt/hadoop-3.4.2
      sudo ln -s /opt/hadoop-3.4.2 /opt/hadoop
      
    3. Edit /opt/hadoop/etc/hadoop/core-site.xml, replace the <configuration></configuration> with:
      <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://master:9000/</value>
        </property>
        <property>
            <name>fs.default.FS</name>
            <value>hdfs://master:9000/</value>
        </property>
      </configuration>
      

      Here we configure Hadoop to run the NameNode daemon at host master.

    4. Edit /opt/hadoop/etc/hadoop/hdfs-site.xml, replace the <configuration></configuration> with:
      <configuration>
        <property>
            <name>dfs.datanode.data.dir</name>
            <value>/opt/hadoop_data/hdfs/datanode</value>
            <final>true</final>
        </property>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>/opt/hadoop_data/hdfs/namenode</value>
            <final>true</final>
        </property>
        <property>
            <name>dfs.namenode.http-address</name>
            <value>master:50070</value>
        </property>
        <property>
            <name>dfs.replication</name>
            <value>3</value>
        </property>
      </configuration>
      

      Here, we configure where NameNode meta data is and where DataNode data is. In addition, we configure replication factor of 3.

    5. Edit /opt/hadoop/etc/hadoop/yarn-site.xml, replace the <configuration></configuration> with:
      <configuration>
        <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>master:8025</value>
        </property>
        <property>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>master:8035</value>
        </property>
        <property>
            <name>yarn.resourcemanager.address</name>
            <value>master:8050</value>
        </property>
      </configuration>
      

      We configure YARN daemons to run at the master host.

    6. Edit /opt/hadoop/etc/hadoop/mapred-site.xml, replace the <configuration></configuration> with:
      <configuration>
        <property>
            <name>mapreduce.job.tracker</name>
            <value>master:5431</value>
        </property>
        <property>
            <name>mapred.framework.name</name>
            <value>yarn</value>
        </property>
      </configuration>
      
    7. Edit /opt/hadoop/etc/hadoop/hadoop-env.sh, and ensure you have the following in the file:
      export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
      
    8. Edit /opt/hadoop/etc/hadoop/workers, so that it becomes empty, e.g.,
      $ > /opt/hadoop/etc/hadoop/workers
      $ cat /opt/hadoop/etc/hadoop/workers
      $
      
    9. Ensure the virtual machine base is allocated minimum RAM, e.g., change it to 256 MB. This will allow us to run multiple virtual machines simultaneously.
  4. Create master node (running NameNode)
    1. If implementing using a virtual machine, create a linked clone of the base machine
    2. Change hostname to master. This requires updating the /etc/hostname file. The content of the file after update:
      $ cat /etc/hostname
      master
      $
      

      Also ensure the IP address match the /etc/hosts file. This can be achieved by updating /etc/network/interfaces file. After the update, we can view the content:

      $ cat /etc/network/interfaces
      source /etc/network/interfaces.d/*
      
      # The loopback network interface
      auto lo
      iface lo inet loopback
      
      # The primary network interface
      allow-hotplug enp0s3
      iface enp0s3 inet dhcp
      # This is an autoconfigured IPv6 interface
      iface enp0s3 inet6 auto
      
      allow-hotplug enp0s8
      iface enp0s8 inet static
      address 192.168.56.4
      netmask 255.255.255.0
      network 192.168.56.0
      broadcast 192.168.56.255
      $
      

      Reboot the machine, after these.

    3. Ensure to log in as user hadoop
      sudo -i -u hadoop
      
    4. Prepare HDFS at the namenode
      hdfs namenode -format
      
    5. Start HDFS dfs daemon
      /opt/hadoop/sbin/start-dfs.sh
      
    6. To check, run jps, e.g.,
      $ jps
      1121 SecondaryNameNode
      977 NameNode
      1228 Jps
      $
      

      which shows both NameNode and SecondaryNameNoe daemons are running.

    7. To check, also run ss -tlpn, e.g., ```sh $ ss -tlpn … Port Process … tcp LISTEN 0 256 192.168.56.4:9000 0.0.0.0:* users:((“java”,pid=977,fd=329)) tcp LISTEN 0 500 192.168.56.4:50070 0.0.0.0:* users:((“java”,pid=977,fd=323)) tcp LISTEN 0 500 0.0.0.0:9868 0.0.0.0:* users:((“java”,pid=1121,fd=324)) … $ Correlating with the output of jps, we can see that NameNode is listening at ports 9000 and 50070 while `SecondaryNameNode’ at port 9868. TCP port 50070 is the default Web UI port for NameNode while 9868 the default for SeconaryNameNode.

      You can use a Web browser to browse these two ports, e.g., open URL

      • http://192.168.56.4:50070
      • http://192.168.56.4:9868
    8. You can also run hdfs dfsadmin -report to check the status:
      Configured Capacity: 0 (0 B)
      Present Capacity: 0 (0 B)
      DFS Remaining: 0 (0 B)
      DFS Used: 0 (0 B)
      DFS Used%: 0.00%
      Replicated Blocks:
            Under replicated blocks: 0
            Blocks with corrupt replicas: 0
            Missing blocks: 0
            Missing blocks (with replication factor 1): 0
            Low redundancy blocks with highest priority to recover: 0
            Pending deletion blocks: 0
      Erasure Coded Block Groups:
            Low redundancy block groups: 0
            Block groups with corrupt internal blocks: 0
            Missing block groups: 0
            Low redundancy blocks with highest priority to recover: 0
            Pending deletion blocks: 0
      

      Since we have configured any DataNode, the capacity should be 0.

  5. Create and Add a DataNode
    1. Acquire a machine – Create a virtual machine by creating a linked clone of the base virtual machine
    2. Set up IP network
      1. update hostname to worker1
      2. change IP address by editing /etc/network/interfaces
    3. Since it is a linked clone, there isn’t a need to install any additional software
    4. Switch to the planned hadoop user
      sudo -i -u hadoop
      
    5. Start the DataNode daemon:
      /opt/hadoop/sbin/hadoop-daemon.sh start datanode
      
    6. To check the health of the DataNode daemon:
      1. At worker, run jps, e.g.,
        $ jps
        837 DataNode
        919 Jps
        $
        
      2. At master, query DFS report, e.g., ```sh $ hdfs dfsadmin -report Configured Capacity: 7853862912 (7.31 GB) Present Capacity: 2349953024 (2.19 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used: 24576 (24 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0

      Live datanodes (1):

      Name: 192.168.56.5:9866 (worker1) Hostname: worker1 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 24576 (24 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 12:56:43 EST 2025 Last Block Report: Tue Nov 18 12:56:30 EST 2025 Num of Blocks: 0

      $ ``` which shows a DataNode process coming online.

    7. To allow restarting all DataNode processes at all worker nodes easily, add worker1 to Hadoop configuration file /opt/hadoop/etc/hadoop/workers, e.g., the content of the file should look like as follows:
      $ cat /opt/hadoop/etc/hadoop/workers
      worker1
      $
      
    8. We shall repeat the process for the rest of worker nodes.
    9. After finishing adding three worker nodes, we can view DFS report: ```sh $ hdfs dfsadmin -report Configured Capacity: 23561588736 (21.94 GB) Present Capacity: 7049863168 (6.57 GB) DFS Remaining: 7049781248 (6.57 GB) DFS Used: 81920 (80 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0

    Live datanodes (3):

    Name: 192.168.56.5:9866 (worker1) Hostname: worker1 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 28672 (28 KB) Non DFS Used: 5082955776 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:26 EST 2025 Last Block Report: Tue Nov 18 12:56:30 EST 2025 Num of Blocks: 0

    Name: 192.168.56.6:9866 (worker2) Hostname: worker2 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 28672 (28 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349924352 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:25 EST 2025 Last Block Report: Tue Nov 18 13:13:01 EST 2025 Num of Blocks: 0

    Name: 192.168.56.7:9866 (worker3) Hostname: worker3 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 24576 (24 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:25 EST 2025 Last Block Report: Tue Nov 18 13:19:34 EST 2025 Num of Blocks: 0

    $ ```

  6. Using Hadoop tools
    1. Using the dfs tool
      1. Create a directory on the DFS
        hdfs dfs -mkdir /demo
        
      2. Add a file on the DFS
        echo "Hello, World!" > helloworld.txt
        hdfs dfs -put helloworld.txt /demo/helloworld.txt
        
    2. Using the fsck tool
      1. Query file block locations
        hdfs fsck /demo/helloworld.txt -files -blocks -locations
        
  7. Running MapReduce tasks
    yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /demo /output
    yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 3 300000000
    

Reference

  1. “Hadoop Cluster Setup”, https://apache.github.io/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html, retrieved November 1, 2025.