如何配置 Ubuntu 上的 Hadoop

Prerequisites
Before configuring Hadoop on Ubuntu, ensure your system meets the following requirements:

Ubuntu 22.04/24.04 LTS (or a stable release).
At least 4GB RAM (8GB+ recommended for production).
At least 10GB free disk space.
Basic familiarity with terminal commands and Linux file systems.

Step 1: Install Java (Required for Hadoop Runtime)
Hadoop requires Java 8 (OpenJDK or Oracle JDK) or later. Run the following commands to install OpenJDK 11 (a compatible LTS version):

sudo apt update
sudo apt install -y openjdk-11-jdk

Verify the installation:

java -version

You should see output indicating OpenJDK 11 is installed (e.g., openjdk version "11.0.xx").

Step 2: Download and Extract Hadoop
Download the latest stable Hadoop binary from the Apache Hadoop website (e.g., 3.3.6 as of November 2025). Use wget to fetch the tarball and extract it to /usr/local:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xzvf hadoop-3.3.6.tar.gz -C /usr/local/

Rename the extracted directory for simplicity:

sudo mv /usr/local/hadoop-3.3.6 /usr/local/hadoop

Step 3: Configure Environment Variables
Set up environment variables to make Hadoop commands accessible globally. Edit ~/.bashrc (for the current user) or /etc/profile (for all users) using a text editor like nano:

nano ~/.bashrc

Add the following lines at the end of the file:

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Save the file (Ctrl+O, Enter, Ctrl+X) and apply changes:

source ~/.bashrc

Verify the configuration by checking the Hadoop version:

hadoop version

You should see the installed Hadoop version (e.g., Hadoop 3.3.6).

Step 4: Configure SSH for Secure Communication
Hadoop requires passwordless SSH between nodes (even for a single-node setup) to manage daemons. Generate an SSH key pair and configure it:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Test passwordless login to localhost:

ssh localhost

You should log in without entering a password. Exit the session with exit.

Step 5: Configure Hadoop Core Files
Navigate to the Hadoop configuration directory and edit the following files to define cluster behavior:

core-site.xml

This file configures Hadoop’s core functionality, including the default file system. Replace the existing content with:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

hdfs-site.xml

This file configures HDFS settings, such as replication factor and data storage paths. Add:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value> <!-- Set to 3 for multi-node clusters -->
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/usr/local/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/usr/local/hadoop/data/datanode</value>
    </property>
</configuration>

mapred-site.xml

This file configures MapReduce to use YARN as the execution framework. First, copy the template file:

cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

Then edit it to include:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml

This file configures YARN (Yet Another Resource Negotiator) for resource management. Add:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

Step 6: Format HDFS (First-Time Setup Only)
Before starting Hadoop for the first time, format the HDFS file system to initialize the NameNode metadata. Run:

hdfs namenode -format

This command creates the required directories (specified in dfs.namenode.name.dir and dfs.datanode.data.dir) and initializes the file system.

Step 7: Start Hadoop Services
Start the HDFS and YARN services using the following commands:

start-dfs.sh  # Starts NameNode, DataNode, and SecondaryNameNode
start-yarn.sh  # Starts ResourceManager and NodeManager

Verify that the services are running by checking for their processes:

jps

You should see output similar to:

1234 NameNode
1256 DataNode
1278 SecondaryNameNode
1300 ResourceManager
1322 NodeManager

Step 8: Verify the Installation
Access the Hadoop web interfaces to confirm the cluster is running:

HDFS NameNode: Open http://localhost:9870 in a browser.
YARN ResourceManager: Open http://localhost:8088 in a browser.

You should see the cluster status, including live nodes and running applications.

Step 9: Stop Hadoop Services (Optional)
To stop the cluster, run:

stop-yarn.sh
stop-dfs.sh

Troubleshooting Tips

Port Conflicts: Ensure ports like 9000 (HDFS), 50070 (HDFS Web UI), and 8088 (YARN Web UI) are not blocked by a firewall.
Permission Issues: Run chmod -R 755 /usr/local/hadoop to grant necessary permissions to the Hadoop directory.
Logs: Check logs in $HADOOP_HOME/logs for errors if services fail to start.

By following these steps, you’ll have a fully functional Hadoop environment on Ubuntu, ready for distributed data processing tasks.

core-site.xml

hdfs-site.xml

mapred-site.xml

yarn-site.xml

最新问答

相关标签