Hadoop: Setting up and Experiment

Components of Hadoop

Hadoop has several daemon programs:

Hadoop File System (HDFS)
- NameNode
- SecondaryNameNode
- DataNode
YARN resource manager
- ResourceManager
- NodeManager
- WebAppProxy
Map-Reduce
- Job History Server

For large installations, these daemon programs are generally arranged to run on separate hosts.

Setting Up Hadoop

We set up a cluster of virtual machines and install and configure Hadoop on the cluster. In this tutorial, we use Oracle VirtualBox and Debian Linux systems.

Since we are using virtual machines, to save disk space and effort, let’s create a Linux system on the virtual machine from which we can make clones.
1. Create a virtual machine. Although we use virtual machines, the steps below are also applicable to physical machines.
2. Set up Linux systems on the machine.
  - When setting up Linux systems, elect minimal installation without GUI. In this way, we can configure machines with small footprint (small RAM and disk).
  - The author uses the Debian Linux distribution and test the rest only on Debian Linux systems.
  - The author also installed the sudo package, and add the user to the sudo group. For instance, to achieve these, open a terminal, and run
    su -c "apt install sudo" su -c "/sbin/usermod -aG sudo $(whoami)"
    After that, exit the terminal.
3. To make security assurance easier, we run Hadoop under a dedicated user. In this guide, we run Hadoop under user hadoop.
```
# add the hadoop user
sudo adduser hadoop
```
4. We also run the rest of the setup under user hadoop. To become user hadoop and go to the home directory of user hadoop, we run:
```
sudo -i -u hadoop
cd
```
5. Install necessary packages and download Hadoop binary. Then as user hadoop, do the following:
```
# ensure the Linux system is up-to-date
sudo apt update && sudo apt upgrade -y
# install JDK
sudo apt install -y default-jdk
# ensure to install openssh server and client
sudo apt install openssh-server openssh-client -y
# ensure openssh server is running
sudo systemctl start ssh && sudo systemctl enable ssh
# ensure shell access without password
#   i.e., ensure ssh public/private key access between machines in the cluster 
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
ssh-copy-id -i ~/.ssh/id_rsa $(whoami)@localhost
ssh -i ~/.ssh/id_rsa2 $(whoami)@localhost ls -a
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
# Download hadoop binary: https://hadoop.apache.org/releases.html
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz
```

Plan for cluster network

The cluster consists of the following hosts

master
worker1
worker2
worker3

For that we update the /etc/hosts file, for instance, in this tutorial, the content of the file after the update will be:

$ cat /etc/hosts
127.0.0.1       localhost

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

192.168.56.3 hdbase
192.168.56.4 master
192.168.56.5 worker1
192.168.56.6 worker2
192.168.56.7 worker3
192.168.56.8 worker4
192.168.56.9 worker5
192.168.56.10 worker6
192.168.56.11 worker7
192.168.56.12 worker8
$

Configure for NameNode and DataNode

Create setting files

# make HDFS file directory for NameMode - only needed for NameNode
sudo mkdir -p /opt/hadoop_data/hdfs/namenode
# make HDFS file directory for DataNode - only needed for v
sudo mkdir -p /opt/hadoop_data/hdfs/datanode
# change ownship to hadoop
sudo chown -R hadoop:hadoop /opt/hadoop_data/

# load hadoop configuration
cat >> ~/.bashrc <<END

# load hadoop settings
. .hadoop_settings
END

Create ` ~/.hadoop_settings` file with the following content:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Install NameNode software

# extract hadoop binary
sudo tar -xzvf hadoop-3.4.2.tar.gz -C /opt
chown -R hadoop:hadoop /opt/hadoop-3.4.2
sudo ln -s /opt/hadoop-3.4.2 /opt/hadoop

Edit /opt/hadoop/etc/hadoop/core-site.xml, replace the <configuration></configuration> with:

<configuration>
  <property>
      <name>fs.default.name</name>
      <value>hdfs://master:9000/</value>
  </property>
  <property>
      <name>fs.default.FS</name>
      <value>hdfs://master:9000/</value>
  </property>
</configuration>

Here we configure Hadoop to run the NameNode daemon at host master.

Edit /opt/hadoop/etc/hadoop/hdfs-site.xml, replace the <configuration></configuration> with:

<configuration>
  <property>
      <name>dfs.datanode.data.dir</name>
      <value>/opt/hadoop_data/hdfs/datanode</value>
      <final>true</final>
  </property>
  <property>
      <name>dfs.namenode.name.dir</name>
      <value>/opt/hadoop_data/hdfs/namenode</value>
      <final>true</final>
  </property>
  <property>
      <name>dfs.namenode.http-address</name>
      <value>master:50070</value>
  </property>
  <property>
      <name>dfs.replication</name>
      <value>3</value>
  </property>
</configuration>

Here, we configure where NameNode meta data is and where DataNode data is. In addition, we configure replication factor of 3.

Edit /opt/hadoop/etc/hadoop/yarn-site.xml, replace the <configuration></configuration> with:

<configuration>
  <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master:8025</value>
  </property>
  <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master:8035</value>
  </property>
  <property>
      <name>yarn.resourcemanager.address</name>
      <value>master:8050</value>
  </property>
</configuration>

We configure YARN daemons to run at the master host.

Edit /opt/hadoop/etc/hadoop/mapred-site.xml, replace the <configuration></configuration> with:

<configuration>
  <property>
      <name>mapreduce.job.tracker</name>
      <value>master:5431</value>
  </property>
  <property>
      <name>mapred.framework.name</name>
      <value>yarn</value>
  </property>
</configuration>

Edit /opt/hadoop/etc/hadoop/hadoop-env.sh, and ensure you have the following in the file:
```
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
```

Edit /opt/hadoop/etc/hadoop/workers, so that it becomes empty, e.g.,

$ > /opt/hadoop/etc/hadoop/workers
$ cat /opt/hadoop/etc/hadoop/workers
$

Ensure the virtual machine base is allocated minimum RAM, e.g., change it to 384 MB. This will allow us to run multiple virtual machines simultaneously. For master node, you might need more RAM.

Create master node (running NameNode)

If implementing using a virtual machine, create a linked clone of the base machine

Change hostname to master. This requires updating the /etc/hostname file. The content of the file after update:

$ cat /etc/hostname
master
$

Also ensure the IP address match the /etc/hosts file. This can be achieved by updating /etc/network/interfaces file. After the update, we can view the content:

$ cat /etc/network/interfaces
source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
allow-hotplug enp0s3
iface enp0s3 inet dhcp
# This is an autoconfigured IPv6 interface
iface enp0s3 inet6 auto

allow-hotplug enp0s8
iface enp0s8 inet static
address 192.168.56.4
netmask 255.255.255.0
network 192.168.56.0
broadcast 192.168.56.255
$

Reboot the machine, after these.

Ensure to log in as user hadoop
```
sudo -i -u hadoop
```
Prepare HDFS at the namenode
```
hdfs namenode -format
```
Start HDFS dfs daemon
```
/opt/hadoop/sbin/start-dfs.sh
```
To check, run jps, e.g.,
```
$ jps
1121 SecondaryNameNode
977 NameNode
1228 Jps
$
```
which shows both NameNode and SecondaryNameNoe daemons are running.

To check, also run ss -tlpn, e.g.,

$ ss -tlpn
...
Port                     Process
...
tcp   LISTEN 0      256                         192.168.56.4:9000       0.0.0.0:*                         users:(("java",pid=977,fd=329))
tcp   LISTEN 0      500                         192.168.56.4:50070      0.0.0.0:*                         users:(("java",pid=977,fd=323))
tcp   LISTEN 0      500                              0.0.0.0:9868       0.0.0.0:*                         users:(("java",pid=1121,fd=324))

$ Correlating with the output of jps, we can see that NameNode is listening at ports 9000 and 50070 while `SecondaryNameNode’ at port 9868. TCP port 50070 is the default Web UI port for NameNode while 9868 the default for SeconaryNameNode.

You can use a Web browser to browse these two ports, e.g., open URL

http://192.168.56.4:50070
http://192.168.56.4:9868

You can also run hdfs dfsadmin -report to check the status:

Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: 0.00%
Replicated Blocks:
      Under replicated blocks: 0
      Blocks with corrupt replicas: 0
      Missing blocks: 0
      Missing blocks (with replication factor 1): 0
      Low redundancy blocks with highest priority to recover: 0
      Pending deletion blocks: 0
Erasure Coded Block Groups:
      Low redundancy block groups: 0
      Block groups with corrupt internal blocks: 0
      Missing block groups: 0
      Low redundancy blocks with highest priority to recover: 0
      Pending deletion blocks: 0

Since we have configured any DataNode, the capacity should be 0.

Create and Add a DataNode

Acquire a machine – Create a virtual machine by creating a linked clone of the base virtual machine
Set up IP network
1. update hostname to worker1
2. change IP address by editing /etc/network/interfaces
Since it is a linked clone, there isn’t a need to install any additional software
Switch to the planned hadoop user
```
sudo -i -u hadoop
```

Start the DataNode daemon:

/opt/hadoop/sbin/hadoop-daemon.sh start datanode

To check the health of the DataNode daemon:

At worker, run jps, e.g.,
```
$ jps
837 DataNode
919 Jps
$
```

At master, query DFS report, e.g.,

$ hdfs dfsadmin -report
Configured Capacity: 7853862912 (7.31 GB)
Present Capacity: 2349953024 (2.19 GB)
DFS Remaining: 2349928448 (2.19 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Replicated Blocks:
      Under replicated blocks: 0
      Blocks with corrupt replicas: 0
      Missing blocks: 0
      Missing blocks (with replication factor 1): 0
      Low redundancy blocks with highest priority to recover: 0
      Pending deletion blocks: 0
Erasure Coded Block Groups:
      Low redundancy block groups: 0
      Block groups with corrupt internal blocks: 0
      Missing block groups: 0
      Low redundancy blocks with highest priority to recover: 0
      Pending deletion blocks: 0

--
Live datanodes (1):

Name: 192.168.56.5:9866 (worker1)
Hostname: worker1
Decommission Status : Normal
Configured Capacity: 7853862912 (7.31 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 5082959872 (4.73 GB)
DFS Remaining: 2349928448 (2.19 GB)
DFS Used%: 0.00%
DFS Remaining%: 29.92%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Nov 18 12:56:43 EST 2025
Last Block Report: Tue Nov 18 12:56:30 EST 2025
Num of Blocks: 0

$

which shows a DataNode process coming online.

To allow restarting all DataNode processes at all worker nodes easily, add worker1 to Hadoop configuration file /opt/hadoop/etc/hadoop/workers, e.g., the content of the file should look like as follows:
```
$ cat /opt/hadoop/etc/hadoop/workers
worker1
$
```
We shall repeat the process for the rest of worker nodes.

After finishing adding three worker nodes, we can view DFS report:

$ hdfs dfsadmin -report
Configured Capacity: 23561588736 (21.94 GB)
Present Capacity: 7049863168 (6.57 GB)
DFS Remaining: 7049781248 (6.57 GB)
DFS Used: 81920 (80 KB)
DFS Used%: 0.00%
Replicated Blocks:
      Under replicated blocks: 0
      Blocks with corrupt replicas: 0
      Missing blocks: 0
      Missing blocks (with replication factor 1): 0
      Low redundancy blocks with highest priority to recover: 0
      Pending deletion blocks: 0
Erasure Coded Block Groups:
      Low redundancy block groups: 0
      Block groups with corrupt internal blocks: 0
      Missing block groups: 0
      Low redundancy blocks with highest priority to recover: 0
      Pending deletion blocks: 0

--
Live datanodes (3):

Name: 192.168.56.5:9866 (worker1)
Hostname: worker1
Decommission Status : Normal
Configured Capacity: 7853862912 (7.31 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 5082955776 (4.73 GB)
DFS Remaining: 2349928448 (2.19 GB)
DFS Used%: 0.00%
DFS Remaining%: 29.92%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Nov 18 13:23:26 EST 2025
Last Block Report: Tue Nov 18 12:56:30 EST 2025
Num of Blocks: 0


Name: 192.168.56.6:9866 (worker2)
Hostname: worker2
Decommission Status : Normal
Configured Capacity: 7853862912 (7.31 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 5082959872 (4.73 GB)
DFS Remaining: 2349924352 (2.19 GB)
DFS Used%: 0.00%
DFS Remaining%: 29.92%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Nov 18 13:23:25 EST 2025
Last Block Report: Tue Nov 18 13:13:01 EST 2025
Num of Blocks: 0


Name: 192.168.56.7:9866 (worker3)
Hostname: worker3
Decommission Status : Normal
Configured Capacity: 7853862912 (7.31 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 5082959872 (4.73 GB)
DFS Remaining: 2349928448 (2.19 GB)
DFS Used%: 0.00%
DFS Remaining%: 29.92%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Nov 18 13:23:25 EST 2025
Last Block Report: Tue Nov 18 13:19:34 EST 2025
Num of Blocks: 0

$

Using Hadoop tools

Using the dfs tool

Create a directory on the DFS
```
hdfs dfs -mkdir /demo
```

Add a file on the DFS

echo "Hello, World!" > helloworld.txt
hdfs dfs -put helloworld.txt /demo/helloworld.txt

Using the fsck tool

Query file block locations

hdfs fsck /demo/helloworld.txt -files -blocks -locations

Running MapReduce tasks

yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /demo /output
yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 3 300000000

Reference

“Hadoop Cluster Setup”, https://apache.github.io/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html, retrieved November 1, 2025.