Prerequisites
Before configuring Hadoop on Ubuntu, ensure your system meets the following requirements:
Step 1: Install Java (Required for Hadoop Runtime)
Hadoop requires Java 8 (OpenJDK or Oracle JDK) or later. Run the following commands to install OpenJDK 11 (a compatible LTS version):
sudo apt update
sudo apt install -y openjdk-11-jdk
Verify the installation:
java -version
You should see output indicating OpenJDK 11 is installed (e.g., openjdk version "11.0.xx").
Step 2: Download and Extract Hadoop
Download the latest stable Hadoop binary from the Apache Hadoop website (e.g., 3.3.6 as of November 2025). Use wget to fetch the tarball and extract it to /usr/local:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xzvf hadoop-3.3.6.tar.gz -C /usr/local/
Rename the extracted directory for simplicity:
sudo mv /usr/local/hadoop-3.3.6 /usr/local/hadoop
Step 3: Configure Environment Variables
Set up environment variables to make Hadoop commands accessible globally. Edit ~/.bashrc (for the current user) or /etc/profile (for all users) using a text editor like nano:
nano ~/.bashrc
Add the following lines at the end of the file:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Save the file (Ctrl+O, Enter, Ctrl+X) and apply changes:
source ~/.bashrc
Verify the configuration by checking the Hadoop version:
hadoop version
You should see the installed Hadoop version (e.g., Hadoop 3.3.6).
Step 4: Configure SSH for Secure Communication
Hadoop requires passwordless SSH between nodes (even for a single-node setup) to manage daemons. Generate an SSH key pair and configure it:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Test passwordless login to localhost:
ssh localhost
You should log in without entering a password. Exit the session with exit.
Step 5: Configure Hadoop Core Files
Navigate to the Hadoop configuration directory and edit the following files to define cluster behavior:
This file configures Hadoop’s core functionality, including the default file system. Replace the existing content with:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
This file configures HDFS settings, such as replication factor and data storage paths. Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value> <!-- Set to 3 for multi-node clusters -->
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
</configuration>
This file configures MapReduce to use YARN as the execution framework. First, copy the template file:
cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml
Then edit it to include:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
This file configures YARN (Yet Another Resource Negotiator) for resource management. Add:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 6: Format HDFS (First-Time Setup Only)
Before starting Hadoop for the first time, format the HDFS file system to initialize the NameNode metadata. Run:
hdfs namenode -format
This command creates the required directories (specified in dfs.namenode.name.dir and dfs.datanode.data.dir) and initializes the file system.
Step 7: Start Hadoop Services
Start the HDFS and YARN services using the following commands:
start-dfs.sh # Starts NameNode, DataNode, and SecondaryNameNode
start-yarn.sh # Starts ResourceManager and NodeManager
Verify that the services are running by checking for their processes:
jps
You should see output similar to:
1234 NameNode
1256 DataNode
1278 SecondaryNameNode
1300 ResourceManager
1322 NodeManager
Step 8: Verify the Installation
Access the Hadoop web interfaces to confirm the cluster is running:
http://localhost:9870 in a browser.http://localhost:8088 in a browser.You should see the cluster status, including live nodes and running applications.
Step 9: Stop Hadoop Services (Optional)
To stop the cluster, run:
stop-yarn.sh
stop-dfs.sh
Troubleshooting Tips
chmod -R 755 /usr/local/hadoop to grant necessary permissions to the Hadoop directory.$HADOOP_HOME/logs for errors if services fail to start.By following these steps, you’ll have a fully functional Hadoop environment on Ubuntu, ready for distributed data processing tasks.