在 CentOS 上部署 HDFS 集群
一 环境准备与规划
二 安装与基础配置
sudo yum install -y java-1.8.0-openjdk-develjava -versionwget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gztar -xzvf hadoop-3.3.1.tar.gz -C /usr/local/ && sudo mv /usr/local/hadoop-3.3.1 /usr/local/hadoop/etc/profile.d/hadoop.sh:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdkexport HADOOP_HOME=/usr/local/hadoopexport PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinsource /etc/profile.d/hadoop.shssh-keygen -t rsa -b 2048ssh-copy-id master、ssh-copy-id slave1、ssh-copy-id slave2sudo mkdir -p /usr/local/hadoop/data/{namenode,datanode}sudo chown -R $(whoami):$(whoami) /usr/local/hadoop三 核心配置
$HADOOP_HOME/etc/hadoop/export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/datanode</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/ 整个目录同步到所有 slave 节点(保持一致性)四 启动与验证
hdfs namenode -formatstart-dfs.shjpshdfs dfsadmin -report(查看集群概况与 DataNode 列表)hdfs dfs -mkdir -p /user/testhdfs dfs -put /etc/hosts /user/testhdfs dfs -ls /user/test五 关键参数与运维建议
dfs.replication:副本数,默认 3;两节点测试可设为 2dfs.blocksize:块大小,默认 128MB;影响 Map 任务数量与 NameNode 内存dfs.namenode.name.dir:NameNode 元数据目录(建议多磁盘/多路径冗余)dfs.datanode.data.dir:DataNode 数据目录(多盘可显著提升 IO)dfs.datanode.max.transfer.threads:默认 4096,建议 8192 或更高(并发传输)dfs.namenode.handler.count:默认 10,可按集群规模调优,经验值约为 20 × ln(N)dfs.datanode.balance.bandwidthPerSec:均衡带宽,默认 1MB/s,可按网络调大dfs.datanode.failed.volumes.tolerated:容忍坏盘数,多盘节点可设为 1–2dfs.permissions.enabled:建议生产环境开启权限校验dfs.hosts / dfs.hosts.exclude:白/黑名单,便于维护摘除节点