Debian如何实现HDFS高可用

Prerequisites for HDFS High Availability (HA) on Debian
Before configuring HDFS HA, ensure the following prerequisites are met:

Debian Nodes: At least 5 nodes (2 NameNodes, 3 JournalNodes, and multiple DataNodes) with identical Debian versions (e.g., Debian 11/12).
Java Environment: Install OpenJDK 11 or 17 on all nodes (sudo apt install openjdk-11-jdk).
Hadoop Installation: Download and extract the same Hadoop version (e.g., 3.3.6) on all nodes. Configure basic environment variables in ~/.bashrc (e.g., export HADOOP_HOME=/usr/local/hadoop, export PATH=$PATH:$HADOOP_HOME/bin).
Hostname & Hosts File: Set unique hostnames (e.g., namenode1, namenode2, journalnode1) and update /etc/hosts with IP-hostname mappings for all nodes.
SSH Configuration: Enable passwordless SSH between all nodes (generate keys with ssh-keygen -t rsa and copy to other nodes using ssh-copy-id).
ZooKeeper Cluster: Deploy a 3-node ZooKeeper ensemble (critical for HA coordination; follow standard ZooKeeper setup steps on separate nodes).

Step 1: Configure JournalNode Nodes
JournalNodes store edit logs (transaction records for HDFS metadata) and ensure consistency between Active and Standby NameNodes.

Create a dedicated directory for JournalNode data:

sudo mkdir -p /usr/local/hadoop/journalnode/data
sudo chown -R $USER:$USER /usr/local/hadoop/journalnode

Add JournalNode configuration to $HADOOP_HOME/etc/hadoop/hdfs-site.xml on all nodes:

<property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/usr/local/hadoop/journalnode/data</value>
</property>

Format JournalNodes (run once on each JournalNode):
```
hdfs namenode -formatJournalNode
```
Start JournalNode service on all JournalNodes:
```
hadoop-daemon.sh start journalnode
```
Verify status with jps (should show JournalNode processes).

Step 2: Configure NameNode High Availability
This step enables two NameNodes (Active/Standby) to share metadata via JournalNodes.

Edit $HADOOP_HOME/etc/hadoop/core-site.xml to define the HDFS namespace and ZooKeeper address:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://mycluster</value> <!-- Logical name for the HDFS cluster -->
</property>
<property>
    <name>ha.zookeeper.quorum</name>
    <value>zk1:2181,zk2:2181,zk3:2181</value> <!-- ZooKeeper ensemble addresses -->
</property>

Edit $HADOOP_HOME/etc/hadoop/hdfs-site.xml to configure NameNode roles, RPC/HTTP addresses, shared edits, and failover:

<property>
    <name>dfs.nameservices</name>
    <value>mycluster</value> <!-- Must match fs.defaultFS -->
</property>
<property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value> <!-- Names of NameNodes -->
</property>
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>namenode1:8020</value> <!-- RPC address for nn1 -->
</property>
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>namenode2:8020</value> <!-- RPC address for nn2 -->
</property>
<property>
    <name>dfs.namenode.http-address.mycluster.nn1</name>
    <value>namenode1:9870</value> <!-- HTTP address for nn1 -->
</property>
<property>
    <name>dfs.namenode.http-address.mycluster.nn2</name>
    <value>namenode2:9870</value> <!-- HTTP address for nn2 -->
</property>
<property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://journalnode1:8485;journalnode2:8485;journalnode3:8485/mycluster</value> <!-- JournalNode quorum for shared edits -->
</property>
<property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> <!-- Client-side failover proxy -->
</property>
<property>
    <name>dfs.ha.fencing.methods</name>
    <value>sshfence</value> <!-- Method to prevent split-brain (e.g., kill standby NN's process) -->
</property>
<property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
    <value>/home/$USER/.ssh/id_rsa</value> <!-- Path to private key for SSH fencing -->
</property>
<property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value> <!-- Enable automatic failover -->
</property>

Format the Active NameNode (run only once on nn1):
```
hdfs namenode -format
```
Start the NameNode on nn1:
```
hadoop-daemon.sh start namenode
```
Bootstrap the Standby NameNode (copy metadata from nn1 to nn2):
```
hdfs namenode -bootstrapStandby
```
Start the NameNode on nn2:
```
hadoop-daemon.sh start namenode
```
Verify both NameNodes are running with hdfs haadmin -getServiceState nn1 (should return “active”) and hdfs haadmin -getServiceState nn2 (should return “standby”).

Step 3: Start HDFS Services
Start all HDFS components in the correct order:

start-dfs.sh  # Starts JournalNodes, NameNodes, and DataNodes

Check cluster status with:

hdfs dfsadmin -report  # Lists DataNodes and their health
hdfs haadmin -getAllServiceStates  # Shows NameNode states (active/standby)

Access NameNode Web UIs (e.g., http://namenode1:9870, http://namenode2:9870) to confirm HA status.

Step 4: Test Automatic Failover
Simulate a failure to verify automatic failover works:

Kill the Active NameNode Process:
On nn1, find the NameNode PID (jps | grep NameNode) and kill it:
```
kill -9 <NameNode_PID>
```

Verify Standby Takes Over:
On nn2, check its state:

hdfs haadmin -getServiceState nn2  # Should return "active"

Restore the Original Active NameNode:
Restart the NameNode on nn1 and verify it becomes standby:

hadoop-daemon.sh start namenode
hdfs haadmin -getServiceState nn1  # Should return "standby"

Check Data Availability:
Create a test file in HDFS and verify it persists after failover:

hdfs dfs -put /local/file.txt /test/
hdfs dfs -get /test/file.txt /local/  # Should succeed after failover

Step 5: Monitor and Maintain
Set up monitoring to detect issues early:

Metrics: Use Hadoop’s built-in metrics (via JMX) or tools like Prometheus + Grafana to track NameNode memory, DataNode disk usage, and replication status.
Logs: Regularly check NameNode logs ($HADOOP_HOME/logs/hadoop-*-namenode-*.log) for errors.
Backups: Backup critical data (e.g., NameNode metadata, ZooKeeper data) to an offsite location.
Updates: Keep Hadoop and ZooKeeper versions up-to-date to patch security vulnerabilities.

最新问答

相关标签