Prerequisites Before configuring HDFS on Ubuntu, ensure your system meets the following requirements:
sudo apt update
sudo apt install openjdk-11-jdk
Verify installation with java -version./usr/local/:wget https://downloads.apache.org/hadoop/core/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzvf hadoop-3.3.4.tar.gz -C /usr/local/
1. Configure Environment Variables
Set up Hadoop environment variables to access commands globally. Edit ~/.bashrc (or /etc/profile for system-wide access) and add:
export HADOOP_HOME=/usr/local/hadoop-3.3.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Apply changes with source ~/.bashrc.
2. Core HDFS Configuration Files
Navigate to the Hadoop configuration directory ($HADOOP_HOME/etc/hadoop) and edit the following files:
a. core-site.xml Defines the default file system and temporary directory. Add:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value> <!-- For standalone mode; use 'hdfs://mycluster' for HA -->
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-3.3.4/tmp</value> <!-- Temporary directory for Hadoop data -->
</property>
</configuration>
b. hdfs-site.xml Configures HDFS-specific settings like replication and NameNode/DataNode directories. Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value> <!-- Replication factor (1 for standalone, 3 for production clusters) -->
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop-3.3.4/data/namenode</value> <!-- Directory for NameNode metadata -->
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop-3.3.4/data/datanode</value> <!-- Directory for DataNode data storage -->
</property>
</configuration>
3. Create HDFS Data Directories
Create the directories specified in hdfs-site.xml and set ownership to the current user (replace yourusername with your actual username):
sudo mkdir -p /usr/local/hadoop-3.3.4/data/namenode
sudo mkdir -p /usr/local/hadoop-3.3.4/data/datanode
sudo chown -R yourusername:yourusername /usr/local/hadoop-3.3.4/data
4. Format the NameNode The NameNode must be formatted before first use to initialize its metadata. Run:
hdfs namenode -format
This command creates the required directory structure and files for the NameNode.
5. Start HDFS Services Start the HDFS services (NameNode and DataNode) using:
start-dfs.sh
Verify the services are running by checking for Hadoop processes:
jps
You should see NameNode, DataNode, and other Hadoop processes listed.
6. Verify HDFS Functionality
http://localhost:9870 (for Hadoop 3.x) to view the HDFS web interface.hdfs dfs -mkdir /user/yourusername # Create a directory
hdfs dfs -put ~/testfile.txt /user/yourusername # Upload a file
hdfs dfs -ls /user/yourusername # List directory contents
7. Optional: Configure SSH for Cluster Nodes If setting up a multi-node cluster, configure SSH passwordless login between nodes to enable secure communication. Generate an SSH key on the master node and copy it to all slave nodes:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
ssh-copy-id slave1
ssh-copy-id slave2
Test the connection with ssh slave1 (replace slave1 with the actual hostname/IP of the slave node).
8. Optional: High Availability (HA) Configuration For production environments, configure HDFS HA to ensure fault tolerance. This involves: