Ubuntu HDFS如何配置

Prerequisites Before configuring HDFS on Ubuntu, ensure your system meets the following requirements:

Java Installation: Hadoop requires Java (OpenJDK 8 or 11 is recommended). Install it using:
```
sudo apt update
sudo apt install openjdk-11-jdk
```
Verify installation with java -version.
Hadoop Download: Download the latest stable Hadoop version from the Apache Hadoop website. Extract it to a directory like /usr/local/:
```
wget https://downloads.apache.org/hadoop/core/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzvf hadoop-3.3.4.tar.gz -C /usr/local/
```

1. Configure Environment Variables Set up Hadoop environment variables to access commands globally. Edit ~/.bashrc (or /etc/profile for system-wide access) and add:

export HADOOP_HOME=/usr/local/hadoop-3.3.4
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Apply changes with source ~/.bashrc.

2. Core HDFS Configuration Files Navigate to the Hadoop configuration directory ($HADOOP_HOME/etc/hadoop) and edit the following files:

a. core-site.xml Defines the default file system and temporary directory. Add:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value> <!-- For standalone mode; use 'hdfs://mycluster' for HA -->
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/hadoop-3.3.4/tmp</value> <!-- Temporary directory for Hadoop data -->
    </property>
</configuration>

b. hdfs-site.xml Configures HDFS-specific settings like replication and NameNode/DataNode directories. Add:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value> <!-- Replication factor (1 for standalone, 3 for production clusters) -->
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/usr/local/hadoop-3.3.4/data/namenode</value> <!-- Directory for NameNode metadata -->
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/usr/local/hadoop-3.3.4/data/datanode</value> <!-- Directory for DataNode data storage -->
    </property>
</configuration>

3. Create HDFS Data Directories Create the directories specified in hdfs-site.xml and set ownership to the current user (replace yourusername with your actual username):

sudo mkdir -p /usr/local/hadoop-3.3.4/data/namenode
sudo mkdir -p /usr/local/hadoop-3.3.4/data/datanode
sudo chown -R yourusername:yourusername /usr/local/hadoop-3.3.4/data

4. Format the NameNode The NameNode must be formatted before first use to initialize its metadata. Run:

hdfs namenode -format

This command creates the required directory structure and files for the NameNode.

5. Start HDFS Services Start the HDFS services (NameNode and DataNode) using:

start-dfs.sh

Verify the services are running by checking for Hadoop processes:

jps

You should see NameNode, DataNode, and other Hadoop processes listed.

6. Verify HDFS Functionality

Web Interface: Open a browser and navigate to http://localhost:9870 (for Hadoop 3.x) to view the HDFS web interface.

Command-Line Operations: Test HDFS commands to ensure functionality:

hdfs dfs -mkdir /user/yourusername  # Create a directory
hdfs dfs -put ~/testfile.txt /user/yourusername  # Upload a file
hdfs dfs -ls /user/yourusername  # List directory contents

7. Optional: Configure SSH for Cluster Nodes If setting up a multi-node cluster, configure SSH passwordless login between nodes to enable secure communication. Generate an SSH key on the master node and copy it to all slave nodes:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
ssh-copy-id slave1
ssh-copy-id slave2

Test the connection with ssh slave1 (replace slave1 with the actual hostname/IP of the slave node).

8. Optional: High Availability (HA) Configuration For production environments, configure HDFS HA to ensure fault tolerance. This involves:

Setting up multiple NameNodes (active/passive).
Configuring JournalNodes to store edit logs.
Using ZooKeeper for failover management. Refer to the Hadoop HA documentation for detailed steps.

最新问答

相关标签