Hadoop: Setting up and Experiment
Components of Hadoop
Hadoop has several daemon programs:
- Hadoop File System (HDFS)
- NameNode
- SecondaryNameNode
- DataNode
- YARN resource manager
- ResourceManager
- NodeManager
- WebAppProxy
- Map-Reduce
- Job History Server
For large installations, these daemon programs are generally arranged to run on separate hosts.
Setting Up Hadoop
We set up a cluster of virtual machines and install and configure Hadoop on the cluster. In this tutorial, we use Oracle VirtualBox and Debian Linux systems.
- Since we are using virtual machines, to save disk space and effort, let’s create a
Linux system on the virtual machine from which we can make clones.
- Create a virtual machine. Although we use virtual machines, the steps below are also applicable to physical machines.
- Set up Linux systems on the machine.
- When setting up Linux systems, elect minimal installation without GUI. In this way, we can configure machines with small footprint (small RAM and disk).
- The author uses the Debian Linux distribution and test the rest only on Debian Linux systems.
- The author also installed the
sudopackage, and add the user to thesudogroup. For instance, to achieve these, open a terminal, and runsu -c "apt install sudo" su -c "/sbin/usermod -aG sudo $(whoami)"After that, exit the terminal.
- To make security assurance easier, we run Hadoop under a dedicated user.
In this guide, we run Hadoop under user
hadoop.# add the hadoop user sudo adduser hadoop - We also run the rest of the setup under user
hadoop. To become userhadoopand go to the home directory of userhadoop, we run:sudo -i -u hadoop cd - Install necessary packages and download Hadoop binary. Then as user
hadoop, do the following:# ensure the Linux system is up-to-date sudo apt update && sudo apt upgrade -y # install JDK sudo apt install -y default-jdk # ensure to install openssh server and client sudo apt install openssh-server openssh-client -y # ensure openssh server is running sudo systemctl start ssh && sudo systemctl enable ssh # ensure shell access without password # i.e., ensure ssh public/private key access between machines in the cluster ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa ssh-copy-id -i ~/.ssh/id_rsa $(whoami)@localhost ssh -i ~/.ssh/id_rsa2 $(whoami)@localhost ls -a chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys # Download hadoop binary: https://hadoop.apache.org/releases.html wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.2/hadoop-3.4.2.tar.gz
- Plan for cluster network
- The cluster consists of the following hosts
- master
- worker1
- worker2
- worker3
For that we update the
/etc/hostsfile, for instance, in this tutorial, the content of the file after the update will be:$ cat /etc/hosts 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters 192.168.56.3 hdbase 192.168.56.4 master 192.168.56.5 worker1 192.168.56.6 worker2 192.168.56.7 worker3 192.168.56.8 worker4 192.168.56.9 worker5 192.168.56.10 worker6 192.168.56.11 worker7 192.168.56.12 worker8 $
- The cluster consists of the following hosts
- Configure for NameNode and DataNode
- Create setting files
# make HDFS file directory for NameMode - only needed for NameNode sudo mkdir -p /opt/hadoop_data/hdfs/namenode # make HDFS file directory for DataNode - only needed for v sudo mkdir -p /opt/hadoop_data/hdfs/datanode # change ownship to hadoop sudo chown -R hadoop:hadoop /opt/hadoop_data/ # load hadoop configuration cat >> ~/.bashrc <<END # load hadoop settings . .hadoop_settings ENDCreate ` ~/.hadoop_settings` file with the following content:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") export HADOOP_HOME=/opt/hadoop export HADOOP_INSTALL=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$PATH:$HADOOP_INSTALL/bin export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" - Install NameNode software
# extract hadoop binary sudo tar -xzvf hadoop-3.4.2.tar.gz -C /opt chown -R hadoop:hadoop /opt/hadoop-3.4.2 sudo ln -s /opt/hadoop-3.4.2 /opt/hadoop - Edit
/opt/hadoop/etc/hadoop/core-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000/</value> </property> <property> <name>fs.default.FS</name> <value>hdfs://master:9000/</value> </property> </configuration>Here we configure Hadoop to run the NameNode daemon at host master.
- Edit
/opt/hadoop/etc/hadoop/hdfs-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>dfs.datanode.data.dir</name> <value>/opt/hadoop_data/hdfs/datanode</value> <final>true</final> </property> <property> <name>dfs.namenode.name.dir</name> <value>/opt/hadoop_data/hdfs/namenode</value> <final>true</final> </property> <property> <name>dfs.namenode.http-address</name> <value>master:50070</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>Here, we configure where NameNode meta data is and where DataNode data is. In addition, we configure replication factor of 3.
- Edit
/opt/hadoop/etc/hadoop/yarn-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8035</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8050</value> </property> </configuration>We configure YARN daemons to run at the master host.
- Edit
/opt/hadoop/etc/hadoop/mapred-site.xml, replace the<configuration></configuration>with:<configuration> <property> <name>mapreduce.job.tracker</name> <value>master:5431</value> </property> <property> <name>mapred.framework.name</name> <value>yarn</value> </property> </configuration> - Edit
/opt/hadoop/etc/hadoop/hadoop-env.sh, and ensure you have the following in the file:export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") - Edit
/opt/hadoop/etc/hadoop/workers, so that it becomes empty, e.g.,$ > /opt/hadoop/etc/hadoop/workers $ cat /opt/hadoop/etc/hadoop/workers $ - Ensure the virtual machine base is allocated minimum RAM, e.g., change it to 256 MB. This will allow us to run multiple virtual machines simultaneously.
- Create setting files
- Create master node (running
NameNode)- If implementing using a virtual machine, create a linked clone of the base machine
- Change hostname to
master. This requires updating the/etc/hostnamefile. The content of the file after update:$ cat /etc/hostname master $Also ensure the IP address match the
/etc/hostsfile. This can be achieved by updating/etc/network/interfacesfile. After the update, we can view the content:$ cat /etc/network/interfaces source /etc/network/interfaces.d/* # The loopback network interface auto lo iface lo inet loopback # The primary network interface allow-hotplug enp0s3 iface enp0s3 inet dhcp # This is an autoconfigured IPv6 interface iface enp0s3 inet6 auto allow-hotplug enp0s8 iface enp0s8 inet static address 192.168.56.4 netmask 255.255.255.0 network 192.168.56.0 broadcast 192.168.56.255 $Reboot the machine, after these.
- Ensure to log in as user
hadoopsudo -i -u hadoop - Prepare HDFS at the namenode
hdfs namenode -format - Start HDFS dfs daemon
/opt/hadoop/sbin/start-dfs.sh - To check, run
jps, e.g.,$ jps 1121 SecondaryNameNode 977 NameNode 1228 Jps $which shows both
NameNodeandSecondaryNameNoedaemons are running. -
To check, also run
ss -tlpn, e.g., ```sh $ ss -tlpn … Port Process … tcp LISTEN 0 256 192.168.56.4:9000 0.0.0.0:* users:((“java”,pid=977,fd=329)) tcp LISTEN 0 500 192.168.56.4:50070 0.0.0.0:* users:((“java”,pid=977,fd=323)) tcp LISTEN 0 500 0.0.0.0:9868 0.0.0.0:* users:((“java”,pid=1121,fd=324)) … $ Correlating with the output ofjps, we can see thatNameNodeis listening at ports 9000 and 50070 while `SecondaryNameNode’ at port 9868. TCP port 50070 is the default Web UI port for NameNode while 9868 the default for SeconaryNameNode.You can use a Web browser to browse these two ports, e.g., open URL
http://192.168.56.4:50070http://192.168.56.4:9868
- You can also run
hdfs dfsadmin -reportto check the status:Configured Capacity: 0 (0 B) Present Capacity: 0 (0 B) DFS Remaining: 0 (0 B) DFS Used: 0 (0 B) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0Since we have configured any DataNode, the capacity should be 0.
- Create and Add a DataNode
- Acquire a machine – Create a virtual machine by creating a linked clone of the base virtual machine
- Set up IP network
- update hostname to
worker1 - change IP address by editing
/etc/network/interfaces
- update hostname to
- Since it is a linked clone, there isn’t a need to install any additional software
- Switch to the planned
hadoopusersudo -i -u hadoop - Start the
DataNodedaemon:/opt/hadoop/sbin/hadoop-daemon.sh start datanode - To check the health of the
DataNodedaemon:- At
worker, runjps, e.g.,$ jps 837 DataNode 919 Jps $ - At
master, query DFS report, e.g., ```sh $ hdfs dfsadmin -report Configured Capacity: 7853862912 (7.31 GB) Present Capacity: 2349953024 (2.19 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used: 24576 (24 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0
Live datanodes (1):
Name: 192.168.56.5:9866 (worker1) Hostname: worker1 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 24576 (24 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 12:56:43 EST 2025 Last Block Report: Tue Nov 18 12:56:30 EST 2025 Num of Blocks: 0
$ ``` which shows a
DataNodeprocess coming online. - At
- To allow restarting all
DataNodeprocesses at all worker nodes easily, addworker1to Hadoop configuration file/opt/hadoop/etc/hadoop/workers, e.g., the content of the file should look like as follows:$ cat /opt/hadoop/etc/hadoop/workers worker1 $ - We shall repeat the process for the rest of worker nodes.
- After finishing adding three worker nodes, we can view DFS report: ```sh $ hdfs dfsadmin -report Configured Capacity: 23561588736 (21.94 GB) Present Capacity: 7049863168 (6.57 GB) DFS Remaining: 7049781248 (6.57 GB) DFS Used: 81920 (80 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0
Live datanodes (3):
Name: 192.168.56.5:9866 (worker1) Hostname: worker1 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 28672 (28 KB) Non DFS Used: 5082955776 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:26 EST 2025 Last Block Report: Tue Nov 18 12:56:30 EST 2025 Num of Blocks: 0
Name: 192.168.56.6:9866 (worker2) Hostname: worker2 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 28672 (28 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349924352 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:25 EST 2025 Last Block Report: Tue Nov 18 13:13:01 EST 2025 Num of Blocks: 0
Name: 192.168.56.7:9866 (worker3) Hostname: worker3 Decommission Status : Normal Configured Capacity: 7853862912 (7.31 GB) DFS Used: 24576 (24 KB) Non DFS Used: 5082959872 (4.73 GB) DFS Remaining: 2349928448 (2.19 GB) DFS Used%: 0.00% DFS Remaining%: 29.92% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Tue Nov 18 13:23:25 EST 2025 Last Block Report: Tue Nov 18 13:19:34 EST 2025 Num of Blocks: 0
$ ```
- Using Hadoop tools
- Using the
dfstool- Create a directory on the DFS
hdfs dfs -mkdir /demo - Add a file on the DFS
echo "Hello, World!" > helloworld.txt hdfs dfs -put helloworld.txt /demo/helloworld.txt
- Create a directory on the DFS
- Using the
fscktool- Query file block locations
hdfs fsck /demo/helloworld.txt -files -blocks -locations
- Query file block locations
- Using the
- Running MapReduce tasks
yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /demo /output yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 3 300000000
Reference
- “Hadoop Cluster Setup”, https://apache.github.io/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html, retrieved November 1, 2025.