1. 前置准备
ssh-keygen -t rsa并分发公钥)、时间同步(ntpdate或chrony)。zoo.cfg(server.1=node1:2888:3888、server.2=node2:2888:3888、server.3=node3:2888:3888),启动ZooKeeper集群(zkServer.sh start)并验证状态(zkServer.sh status)。2. 配置HDFS高可用(NameNode HA)
hdfs://mycluster),指定ZooKeeper集群地址(ha.zookeeper.quorum=node1:2181,node2:2181,node3:2181)。mycluster)、NameNode节点列表(dfs.ha.namenodes.mycluster=nn1,nn2)、RPC地址(dfs.namenode.rpc-address.mycluster.nn1=node1:9000、dfs.namenode.rpc-address.mycluster.nn2=node2:9000)、HTTP地址(dfs.namenode.http-address.mycluster.nn1=node1:9870、dfs.namenode.http-address.mycluster.nn2=node2:9870);dfs.namenode.shared.edits.dir=qjournal://node1:8485;node2:8485;node3:8485/mycluster);dfs.ha.automatic-failover.enabled=true);dfs.client.failover.proxy.provider.mycluster=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider);dfs.ha.fencing.methods=sshfence、dfs.ha.fencing.ssh.private-key-files=/root/.ssh/id_rsa)。hdfs --daemon start journalnode,启动后通过jps验证进程是否存在。hdfs namenode -format,初始化NameNode元数据;hdfs --daemon start namenode);hdfs namenode -bootstrapStandby,同步主NameNode的元数据;hdfs --daemon start namenode);hdfs zkfc -formatZK格式化ZooKeeper中的HA状态,然后hdfs --daemon start zkfc),实现自动故障转移。3. 配置YARN高可用(ResourceManager HA)
yarn.resourcemanager.ha.enabled=true);yarn.resourcemanager.cluster-id=yarn1)、ResourceManager节点列表(yarn.resourcemanager.ha.rm-ids=rm1,rm2)、ZooKeeper地址(yarn.resourcemanager.zk-address=node1:2181,node2:2181,node3:2181);yarn.resourcemanager.hostname.rm1=node1、yarn.resourcemanager.hostname.rm2=node2)。start-yarn.sh,启动ResourceManager进程;通过yarn rmadmin -getServiceState rm1验证主备状态。4. DataNode与NodeManager配置
hdfs-site.xml中配置数据存储目录(dfs.datanode.data.dir=/export/data/hadoop/datanode),启动DataNode(hdfs --daemon start datanode)。yarn-site.xml中配置NodeManager地址(yarn.nodemanager.hostname=nodeX),启动NodeManager(yarn --daemon start nodemanager)。5. 验证高可用性
jps,确认进程正常(NameNode节点有NameNode、ZKFC;ResourceManager节点有ResourceManager;JournalNode节点有JournalNode;DataNode节点有DataNode;NodeManager节点有NodeManager)。http://node1:9870),查看NameNode状态(应显示“Active”或“Standby”);访问ResourceManager Web UI(如http://node1:8088),查看ResourceManager状态。kill -9 <NameNode_PID>),等待10-30秒,备用NameNode应自动切换为Active状态;