Prerequisites for Debian Hadoop High Availability (HA)
Before configuring HA, ensure you have:
sudo apt install openjdk-11-jdk).ssh-keygen and ssh-copy-id) for seamless communication.1. Configure ZooKeeper Cluster (Coordination Service)
ZooKeeper is essential for monitoring NameNode/ResourceManager health and triggering automatic failover.
sudo apt install zookeeper zookeeperd
/etc/zookeeper/conf/zoo.cfg on all nodes to include cluster members:dataDir=/var/lib/zookeeper
clientPort=2181
server.1=zoo1:2888:3888 # Replace with your node hostnames
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
Create myid file in /var/lib/zookeeper on each node with a unique ID (e.g., 1 for zoo1, 2 for zoo2).sudo systemctl start zookeeper on all nodes and verify status with sudo systemctl status zookeeper.2. Configure HDFS High Availability (NameNode HA)
HDFS HA eliminates the single point of failure (SPOF) of the NameNode using Active/Standby nodes and JournalNodes for metadata synchronization.
Modify core-site.xml: Define the HDFS namespace and ZooKeeper quorum (for ZKFC):
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value> <!-- Logical cluster name -->
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>zoo1:2181,zoo2:2181,zoo3:2181</value> <!-- ZooKeeper ensemble -->
</property>
Modify hdfs-site.xml: Configure NameNode roles, shared storage (JournalNodes), and failover settings:
<property>
<name>dfs.nameservices</name>
<value>mycluster</value> <!-- Must match fs.defaultFS -->
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value> <!-- Active and Standby NameNode IDs -->
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>namenode1:8020</value> <!-- RPC address for nn1 -->
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>namenode2:8020</value> <!-- RPC address for nn2 -->
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://journalnode1:8485;journalnode2:8485;journalnode3:8485/mycluster</value> <!-- JournalNode quorum -->
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value> <!-- Enable automatic failover -->
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> <!-- Client-side proxy for failover -->
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value> <!-- Prevent split-brain (e.g., kill old Active process) -->
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value> <!-- SSH key for fencing -->
</property>
Start JournalNodes: On each JournalNode node, run:
hadoop-daemon.sh start journalnode
Verify with jps (look for JournalNode processes).
Format and Start NameNodes:
hdfs namenode -format
start-dfs.sh
hdfs haadmin -report (should show one Active, one Standby).3. Configure YARN High Availability (ResourceManager HA)
YARN HA ensures the ResourceManager (which schedules jobs) remains available even if one instance fails.
Modify yarn-site.xml: Configure ResourceManager roles and ZooKeeper for state storage:
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yarn-cluster</value> <!-- Unique cluster ID -->
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value> <!-- Active and Standby ResourceManager IDs -->
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>zoo1:2181,zoo2:2181,zoo3:2181</value> <!-- ZooKeeper ensemble -->
</property>
<property>
<name>yarn.resourcemanager.ha.id</name>
<value>rm1</value> <!-- Set to rm1 for Active RM, rm2 for Standby -->
</property>
Start YARN: On the Active ResourceManager (e.g., resourcemanager1), run:
start-yarn.sh
The Standby ResourceManager (e.g., resourcemanager2) will automatically sync state from ZooKeeper.
4. Validate High Availability
hdfs haadmin -report to confirm one Active and one Standby NameNode.kill -9 the NameNode process on the active node).hdfs haadmin -report again— the Standby should become Active.yarn node -list to verify the Active ResourceManager is handling requests.hadoop jar hadoop-mapreduce-examples.jar pi 10 100) to ensure the cluster functions during failover.Key Notes for Production