Before deploying Hadoop, ensure the following prerequisites are met:
/etc/hosts to avoid DNS dependencies).hadoop) with sudo privileges for security.Hadoop requires Java 8 (OpenJDK or Oracle JDK). Run the following commands on all nodes:
# Install OpenJDK 8
sudo yum install -y java-1.8.0-openjdk-devel
# Verify installation
java -version # Should show Java 1.8.x
# Set JAVA_HOME (replace path if using Oracle JDK)
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk" >> ~/.bashrc
echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> ~/.bashrc
source ~/.bashrc
This ensures Hadoop can locate the Java runtime.
Download the latest stable Hadoop release from the Apache website. Extract it to a dedicated directory (e.g., /usr/local):
# Create a hadoop user-owned directory
sudo mkdir -p /usr/local/hadoop
sudo chown -R hadoop:hadoop /usr/local/hadoop
# Download and extract (replace version as needed)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzvf hadoop-3.3.4.tar.gz -C /usr/local/hadoop --strip-components=1
This installs Hadoop in /usr/local/hadoop with proper ownership.
Set up environment variables to make Hadoop commands accessible globally. Edit ~/.bashrc (or /etc/profile for system-wide access):
echo "export HADOOP_HOME=/usr/local/hadoop" >> ~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> ~/.bashrc
source ~/.bashrc
Verify with hadoop version—it should display the installed version.
Hadoop requires passwordless SSH between the NameNode and DataNodes for secure communication. On the NameNode (e.g., node1):
# Generate SSH key pair
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
# Copy public key to local machine (for testing)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost # Test login (should not prompt for password)
# Copy key to all DataNodes (replace node2, node3 with actual IPs/hostnames)
ssh-copy-id hadoop@node2
ssh-copy-id hadoop@node3
Repeat for all nodes to enable seamless communication.
Edit Hadoop’s configuration files in $HADOOP_HOME/etc/hadoop to define cluster behavior:
Specifies the default file system (HDFS) and NameNode address:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1:9000</value> <!-- Replace with your NameNode's hostname/IP -->
</property>
</configuration>
Configures HDFS replication (set to 3 for production, 1 for testing) and data directories:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value> <!-- Change to 3 for multi-node clusters -->
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
</configuration>
Enables YARN as the MapReduce framework (create the file if it doesn’t exist, copying from mapred-site.xml.template):
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configures YARN’s ResourceManager and shuffle service:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value> <!-- Replace with your ResourceManager's hostname/IP -->
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
These configurations define the cluster’s core structure.
The NameNode must be formatted once before first use to initialize its metadata storage. Run this command on the NameNode:
hdfs namenode -format
This creates the HDFS directories specified in hdfs-site.xml and initializes the file system.
Start HDFS and YARN services using the following commands on the NameNode:
# Start HDFS (NameNode and DataNodes)
start-dfs.sh
# Start YARN (ResourceManager and NodeManagers)
start-yarn.sh
To verify services are running, use jps on each node:
NameNode and SecondaryNameNode.DataNode.ResourceManager.NodeManager.Check the health of your cluster using these commands:
# View HDFS status
hdfs dfsadmin -report # Shows DataNodes and storage usage
# Access Web UIs
# HDFS NameNode: http://node1:9870 (default port)
# YARN ResourceManager: http://node1:8088 (default port)
The Web UIs provide real-time insights into cluster metrics (e.g., node status, storage usage).
For production environments, configure Hadoop HA to eliminate single points of failure (SPOFs). This involves:
By following these steps, you can successfully deploy a Hadoop cluster on CentOS, enabling distributed storage and processing of large datasets.