Hadoop: Setting up and Experiment
Components of Hadoop
Hadoop has several daemon programs:
- Hadoop File System (HDFS)
- NameNode
- SecondaryNameNode
- DataNode
- YARN resource manager
- ResourceManager
- NodeManager
- WebAppProxy
- Map-Reduce
- Job History Server
For large installations, these daemon programs are generally arranged to run on separate hosts.
Setting Up Hadoop
We set up a cluster of virtual machines and install and configure Hadoop on the cluster. In this tutorial, we use Oracle VirtualBox and Debian Linux systems.
- Since we are using virtual machines, to save disk space and effort, let’s create a
Linux system on the virtual machine from which we can make clones.
- Create a virtual machine. Although we use virtual machines, the steps below are also applicable to physical machines.
- Set up Linux systems on the machine.
- When setting up Linux systems, elect minimal installation without GUI. In this way, we can configure machines with small footprint (small RAM and disk).
- The author uses the Debian Linux distribution and test the rest only on Debian Linux systems.
- The author also installed the
sudopackage, and add the user to thesudogroup. For instance, to achieve these, open a terminal, and runsu -c "apt install sudo" su -c "/sbin/usermod -aG sudo $(whoami)"After that, exit the terminal.
- To make security assurance easier, we run Hadoop under a dedicated user.
In this guide, we run Hadoop under user
hadoop.# add the hadoop user sudo adduser hadoop - We also run the rest of the setup under user
hadoop. To become userhadoopand go to the home directory of userhadoop, we run:sudo -i -u hadoop cd - Install necessary packages and download Hadoop binary. Then as user
hadoop, do the following:# ensure the Linux system is up-to-date sudo apt update && sudo apt upgrade -y # install JDK sudo apt install -y default-jdk # ensure to install openssh server and client sudo apt install openssh-server openssh-client -y # ensure openssh server is running sudo systemctl start ssh && sudo systemctl enable ssh # ensure shell access without password # i.e., ensure ssh public/private key access between machines in the cluster ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa ssh-copy-id -i ~/.ssh/id_rsa $(whoami)@localhost ssh -i ~/.ssh/id_rsa2 $(whoami)@localhost ls -a chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys # Download hadoop binary: https://hadoop.apache.org/releases.html wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz
- Plan for cluster network
- The cluster consists of the following hosts
- master
- worker1
- worker2
- worker3
For that we update the
/etc/hostsfile, for instance, in this tutorial, the content of the file after the update will be:$ cat /etc/hosts 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters 192.168.56.3 hdbase 192.168.56.4 master 192.168.56.5 worker1 192.168.56.6 worker2 192.168.56.7 worker3 192.168.56.8 worker4 192.168.56.9 worker5 192.168.56.10 worker6 192.168.56.11 worker7 192.168.56.12 worker8 $
- The cluster consists of the following hosts
- Configure for NameNode and DataNode
- Create setting files
# make HDFS file directory for NameMode - only needed for NameNode sudo mkdir -p /opt/hadoop_data/hdfs/namenode # make HDFS file directory for DataNode - only needed for v sudo mkdir -p /opt/hadoop_data/hdfs/datanode # change ownship to hadoop sudo chown -R hadoop:hadoop /opt/hadoop_data/ # load hadoop configuration cat >> ~/.bashrc <<END # load hadoop settings . .hadoop_settings ENDCreate ` ~/.hadoop_settings` file with the following content:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") export HADOOP_HOME=/opt/hadoop export HADOOP_INSTALL=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$PATH:$HADOOP_INSTALL/bin export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" - Install NameNode software
# extract hadoop binary sudo tar -xzvf hadoop-3.4.2.tar.gz -C /opt chown -R hadoop:hadoop /opt/hadoop-3.4.2 sudo ln -s /opt/hadoop-3.4.2 /opt/hadoop - Edit
/opt/hadoop/etc/hadoop/core-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000/</value> </property> <property> <name>fs.default.FS</name> <value>hdfs://master:9000/</value> </property> </configuration>Here we configure Hadoop to run the NameNode daemon at host master.
- Edit
/opt/hadoop/etc/hadoop/hdfs-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>dfs.datanode.data.dir</name> <value>/opt/hadoop_data/hdfs/datanode</value> <final>true</final> </property> <property> <name>dfs.namenode.name.dir</name> <value>/opt/hadoop_data/hdfs/namenode</value> <final>true</final> </property> <property> <name>dfs.namenode.http-address</name> <value>master:50070</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>Here, we configure where NameNode meta data is and where DataNode data is. In addition, we configure replication factor of 3.
- Edit
/opt/hadoop/etc/hadoop/yarn-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8035</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8050</value> </property> </configuration>We configure YARN daemons to run at the master host.
- Edit
/opt/hadoop/etc/hadoop/mapred-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>mapreduce.job.tracker</name> <value>master:5431</value> </property> <property> <name>mapred.framework.name</name> <value>yarn</value> </property> </configuration> - Edit
/opt/hadoop/etc/hadoop/hadoop-env.sh, and ensure you have the following in the file:export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") - Edit
/opt/hadoop/etc/hadoop/workers, so that it becomes empty, e.g.,$ > /opt/hadoop/etc/hadoop/workers $ cat /opt/hadoop/etc/hadoop/workers $ - Ensure the virtual machine base is allocated minimum RAM, e.g., change it to 384 MB. This will allow us to run multiple virtual machines simultaneously. For master node, you might need more RAM.
- Create setting files
- Create master node (running
NameNode)- If implementing using a virtual machine, create a linked clone of the base machine
- Change hostname to
master. This requires updating the/etc/hostnamefile. The content of the file after update:$ cat /etc/hostname master $Also ensure the IP address match the
/etc/hostsfile. This can be achieved by updating/etc/network/interfacesfile. After the update, we can view the content:$ cat /etc/network/interfaces source /etc/network/interfaces.d/* # The loopback network interface auto lo iface lo inet loopback # The primary network interface allow-hotplug enp0s3 iface enp0s3 inet dhcp # This is an autoconfigured IPv6 interface iface enp0s3 inet6 auto allow-hotplug enp0s8 iface enp0s8 inet static address 192.168.56.4 netmask 255.255.255.0 network 192.168.56.0 broadcast 192.168.56.255 $Reboot the machine, after these.
- Ensure to log in as user
hadoopsudo -i -u hadoop - Prepare HDFS at the namenode
hdfs namenode -format - Start HDFS dfs daemon
/opt/hadoop/sbin/start-dfs.sh - To check, run
jps, e.g.,$ jps 1121 SecondaryNameNode 977 NameNode 1228 Jps $which shows both
NameNodeandSecondaryNameNoedaemons are running. - To check, also run
ss -tlpn, e.g.,$ ss -tlpn ... Port Process ... tcp LISTEN 0 256 192.168.56.4:9000 0.0.0.0:* users:(("java",pid=977,fd=329)) tcp LISTEN 0 500 192.168.56.4:50070 0.0.0.0:* users:(("java",pid=977,fd=323)) tcp LISTEN 0 500 0.0.0.0:9868 0.0.0.0:* users:(("java",pid=1121,fd=324))$ Correlating with the output of
jps, we can see thatNameNodeis listening at ports 9000 and 50070 while `SecondaryNameNode’ at port 9868. TCP port 50070 is the default Web UI port for NameNode while 9868 the default for SeconaryNameNode.You can use a Web browser to browse these two ports, e.g., open URL
http://192.168.56.4:50070http://192.168.56.4:9868
- You can also run
hdfs dfsadmin -reportto check the status:Configured Capacity: 0 (0 B) Present Capacity: 0 (0 B) DFS Remaining: 0 (0 B) DFS Used: 0 (0 B) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0Since we have configured any DataNode, the capacity should be 0.
- Create and Add a DataNode
- Acquire a machine – Create a virtual machine by creating a linked clone of the base virtual machine
- Set up IP network
- update hostname to
worker1 - change IP address by editing
/etc/network/interfaces
- update hostname to
- Since it is a linked clone, there isn’t a need to install any additional software
- Switch to the planned
hadoopusersudo -i -u hadoop - Start the
DataNodedaemon:/opt/hadoop/sbin/hadoop-daemon.sh start datanode - To check the health of the
DataNodedaemon:- At
worker, runjps, e.g.,$ jps 837 DataNode 919 Jps $ - At
master, query DFS report, e.g.,$ hdfs dfsadmin -report Configured Capacity: 7853862912 (7.31 GB) Present Capacity: 2349953024 (2.19 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used: 24576 (24 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 -- Live datanodes (1): Name: 192.168.56.5:9866 (worker1) Hostname: worker1 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 24576 (24 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 12:56:43 EST 2025 Last Block Report: Tue Nov 18 12:56:30 EST 2025 Num of Blocks: 0 $which shows a
DataNodeprocess coming online.
- At
- To allow restarting all
DataNodeprocesses at all worker nodes easily, addworker1to Hadoop configuration file/opt/hadoop/etc/hadoop/workers, e.g., the content of the file should look like as follows:$ cat /opt/hadoop/etc/hadoop/workers worker1 $ - We shall repeat the process for the rest of worker nodes.
- After finishing adding three worker nodes, we can view DFS report:
$ hdfs dfsadmin -report Configured Capacity: 23561588736 (21.94 GB) Present Capacity: 7049863168 (6.57 GB) DFS Remaining: 7049781248 (6.57 GB) DFS Used: 81920 (80 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 -- Live datanodes (3): Name: 192.168.56.5:9866 (worker1) Hostname: worker1 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 28672 (28 KB) Non DFS Used: 5082955776 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:26 EST 2025 Last Block Report: Tue Nov 18 12:56:30 EST 2025 Num of Blocks: 0 Name: 192.168.56.6:9866 (worker2) Hostname: worker2 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 28672 (28 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349924352 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:25 EST 2025 Last Block Report: Tue Nov 18 13:13:01 EST 2025 Num of Blocks: 0 Name: 192.168.56.7:9866 (worker3) Hostname: worker3 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 24576 (24 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:25 EST 2025 Last Block Report: Tue Nov 18 13:19:34 EST 2025 Num of Blocks: 0 $
- Using Hadoop tools
- Using the
dfstool- Create a directory on the DFS
hdfs dfs -mkdir /demo - Add a file on the DFS
echo "Hello, World!" > helloworld.txt hdfs dfs -put helloworld.txt /demo/helloworld.txt
- Create a directory on the DFS
- Using the
fscktool- Query file block locations
hdfs fsck /demo/helloworld.txt -files -blocks -locations
- Query file block locations
- Using the
- Running MapReduce tasks
yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /demo /output yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 3 300000000
Reference
- “Hadoop Cluster Setup”, https://apache.github.io/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html, retrieved November 1, 2025.