Debian Hadoop如何实现高可用

Prerequisites for Debian Hadoop High Availability (HA)
Before configuring HA, ensure you have:

At least 3 Debian nodes (for NameNode, DataNode, JournalNode, ResourceManager roles) with static IPs and proper hostname/DNS resolution.
Java (OpenJDK 8 or 11) installed on all nodes (sudo apt install openjdk-11-jdk).
Hadoop (version 3.x recommended) downloaded and extracted on all nodes.
Passwordless SSH configured between all nodes (using ssh-keygen and ssh-copy-id) for seamless communication.
ZooKeeper cluster (3 or 5 nodes) set up for coordination (critical for automatic failover).

1. Configure ZooKeeper Cluster (Coordination Service)
ZooKeeper is essential for monitoring NameNode/ResourceManager health and triggering automatic failover.

Install ZooKeeper: On each ZooKeeper node, run:
```
sudo apt install zookeeper zookeeperd
```
Configure ZooKeeper: Edit /etc/zookeeper/conf/zoo.cfg on all nodes to include cluster members:
```
dataDir=/var/lib/zookeeper
clientPort=2181
server.1=zoo1:2888:3888  # Replace with your node hostnames
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
```
Create myid file in /var/lib/zookeeper on each node with a unique ID (e.g., 1 for zoo1, 2 for zoo2).
Start ZooKeeper: Run sudo systemctl start zookeeper on all nodes and verify status with sudo systemctl status zookeeper.

2. Configure HDFS High Availability (NameNode HA)
HDFS HA eliminates the single point of failure (SPOF) of the NameNode using Active/Standby nodes and JournalNodes for metadata synchronization.

Modify core-site.xml: Define the HDFS namespace and ZooKeeper quorum (for ZKFC):

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://mycluster</value> <!-- Logical cluster name -->
</property>
<property>
  <name>ha.zookeeper.quorum</name>
  <value>zoo1:2181,zoo2:2181,zoo3:2181</value> <!-- ZooKeeper ensemble -->
</property>

Modify hdfs-site.xml: Configure NameNode roles, shared storage (JournalNodes), and failover settings:

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value> <!-- Must match fs.defaultFS -->
</property>
<property>
  <name>dfs.ha.namenodes.mycluster</name>
  <value>nn1,nn2</value> <!-- Active and Standby NameNode IDs -->
</property>
<property>
  <name>dfs.namenode.rpc-address.mycluster.nn1</name>
  <value>namenode1:8020</value> <!-- RPC address for nn1 -->
</property>
<property>
  <name>dfs.namenode.rpc-address.mycluster.nn2</name>
  <value>namenode2:8020</value> <!-- RPC address for nn2 -->
</property>
<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://journalnode1:8485;journalnode2:8485;journalnode3:8485/mycluster</value> <!-- JournalNode quorum -->
</property>
<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value> <!-- Enable automatic failover -->
</property>
<property>
  <name>dfs.client.failover.proxy.provider.mycluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> <!-- Client-side proxy for failover -->
</property>
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value> <!-- Prevent split-brain (e.g., kill old Active process) -->
</property>
<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/root/.ssh/id_rsa</value> <!-- SSH key for fencing -->
</property>

Start JournalNodes: On each JournalNode node, run:
```
hadoop-daemon.sh start journalnode
```
Verify with jps (look for JournalNode processes).
Format and Start NameNodes:
- On the Active NameNode (e.g., namenode1), format the NameNode:
```
hdfs namenode -format
```
- Start the NameNodes on both nodes:
```
start-dfs.sh
```
- Check NameNode status with hdfs haadmin -report (should show one Active, one Standby).

3. Configure YARN High Availability (ResourceManager HA)
YARN HA ensures the ResourceManager (which schedules jobs) remains available even if one instance fails.

Modify yarn-site.xml: Configure ResourceManager roles and ZooKeeper for state storage:

<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
</property>
<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>yarn-cluster</value> <!-- Unique cluster ID -->
</property>
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value> <!-- Active and Standby ResourceManager IDs -->
</property>
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>zoo1:2181,zoo2:2181,zoo3:2181</value> <!-- ZooKeeper ensemble -->
</property>
<property>
  <name>yarn.resourcemanager.ha.id</name>
  <value>rm1</value> <!-- Set to rm1 for Active RM, rm2 for Standby -->
</property>

Start YARN: On the Active ResourceManager (e.g., resourcemanager1), run:
```
start-yarn.sh
```
The Standby ResourceManager (e.g., resourcemanager2) will automatically sync state from ZooKeeper.

4. Validate High Availability

Check NameNode Status: Run hdfs haadmin -report to confirm one Active and one Standby NameNode.
Test Failover:
- Simulate Active NameNode failure (e.g., kill -9 the NameNode process on the active node).
- Wait 30–60 seconds (ZooKeeper election time) and run hdfs haadmin -report again— the Standby should become Active.
Check ResourceManager Status: Run yarn node -list to verify the Active ResourceManager is handling requests.
Submit a Test Job: Run a simple MapReduce job (e.g., hadoop jar hadoop-mapreduce-examples.jar pi 10 100) to ensure the cluster functions during failover.

Key Notes for Production

Use odd number of JournalNodes (3 or 5) to avoid split-brain scenarios.
Secure ZooKeeper with authentication (e.g., SASL) in multi-tenant environments.
Monitor cluster health with tools like Prometheus + Grafana or Ambari to detect issues early.
Regularly back up NameNode metadata (stored in JournalNodes) to prevent data loss.

最新问答

相关标签