CentOS Hadoop Deployment Guide

1. Environment Preparation

Before deploying Hadoop, ensure the following prerequisites are met:

Operating System: CentOS 7 or higher (64-bit recommended).
Hardware Requirements: At least 2GB RAM (4GB+ for production), 40GB+ storage per node, dual-core CPU.
Network Configuration: All nodes must be in the same LAN with static IP addresses and hostname resolution (configure /etc/hosts to avoid DNS dependencies).
User Setup: Create a non-root user (e.g., hadoop) with sudo privileges for security.

2. Install Java Environment

Hadoop requires Java 8 (OpenJDK or Oracle JDK). Run the following commands on all nodes:

# Install OpenJDK 8
sudo yum install -y java-1.8.0-openjdk-devel

# Verify installation
java -version  # Should show Java 1.8.x

# Set JAVA_HOME (replace path if using Oracle JDK)
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk" >> ~/.bashrc
echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> ~/.bashrc
source ~/.bashrc

This ensures Hadoop can locate the Java runtime.

3. Download and Extract Hadoop

Download the latest stable Hadoop release from the Apache website. Extract it to a dedicated directory (e.g., /usr/local):

# Create a hadoop user-owned directory
sudo mkdir -p /usr/local/hadoop
sudo chown -R hadoop:hadoop /usr/local/hadoop

# Download and extract (replace version as needed)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzvf hadoop-3.3.4.tar.gz -C /usr/local/hadoop --strip-components=1

This installs Hadoop in /usr/local/hadoop with proper ownership.

4. Configure Hadoop Environment Variables

Set up environment variables to make Hadoop commands accessible globally. Edit ~/.bashrc (or /etc/profile for system-wide access):

echo "export HADOOP_HOME=/usr/local/hadoop" >> ~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> ~/.bashrc
source ~/.bashrc

Verify with hadoop version—it should display the installed version.

5. Configure SSH Passwordless Login

Hadoop requires passwordless SSH between the NameNode and DataNodes for secure communication. On the NameNode (e.g., node1):

# Generate SSH key pair
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

# Copy public key to local machine (for testing)
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost  # Test login (should not prompt for password)

# Copy key to all DataNodes (replace node2, node3 with actual IPs/hostnames)
ssh-copy-id hadoop@node2
ssh-copy-id hadoop@node3

Repeat for all nodes to enable seamless communication.

6. Configure Hadoop Core Files

Edit Hadoop’s configuration files in $HADOOP_HOME/etc/hadoop to define cluster behavior:

core-site.xml

Specifies the default file system (HDFS) and NameNode address:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node1:9000</value> <!-- Replace with your NameNode's hostname/IP -->
    </property>
</configuration>

hdfs-site.xml

Configures HDFS replication (set to 3 for production, 1 for testing) and data directories:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value> <!-- Change to 3 for multi-node clusters -->
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/usr/local/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/usr/local/hadoop/data/datanode</value>
    </property>
</configuration>

mapred-site.xml

Enables YARN as the MapReduce framework (create the file if it doesn’t exist, copying from mapred-site.xml.template):

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml

Configures YARN’s ResourceManager and shuffle service:

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node1</value> <!-- Replace with your ResourceManager's hostname/IP -->
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

These configurations define the cluster’s core structure.

7. Format HDFS NameNode

The NameNode must be formatted once before first use to initialize its metadata storage. Run this command on the NameNode:

hdfs namenode -format

This creates the HDFS directories specified in hdfs-site.xml and initializes the file system.

8. Start Hadoop Cluster

Start HDFS and YARN services using the following commands on the NameNode:

# Start HDFS (NameNode and DataNodes)
start-dfs.sh

# Start YARN (ResourceManager and NodeManagers)
start-yarn.sh

To verify services are running, use jps on each node:

NameNode: Should show NameNode and SecondaryNameNode.
DataNode: Should show DataNode.
ResourceManager: Should show ResourceManager.
NodeManager: Should show NodeManager.

9. Verify Cluster Status

Check the health of your cluster using these commands:

# View HDFS status
hdfs dfsadmin -report  # Shows DataNodes and storage usage

# Access Web UIs
# HDFS NameNode: http://node1:9870 (default port)
# YARN ResourceManager: http://node1:8088 (default port)

The Web UIs provide real-time insights into cluster metrics (e.g., node status, storage usage).

10. Optional: Deploy High Availability (HA)

For production environments, configure Hadoop HA to eliminate single points of failure (SPOFs). This involves:

Setting up two NameNodes (active/passive).
Using ZooKeeper for leader election.
Configuring JournalNodes for shared storage.
Refer to the Hadoop HA documentation for detailed steps.

By following these steps, you can successfully deploy a Hadoop cluster on CentOS, enabling distributed storage and processing of large datasets.

CentOS Hadoop部署指南