在 Ubuntu 上高效执行 Hadoop 任务的实用指南
一 环境与基础配置
sudo apt update && sudo apt install -y openjdk-8-jdkexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64sudo apt install -y openssh-serverssh-keygen -t rsa && cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keysexport HADOOP_HOME=/usr/local/hadoop && export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinfs.defaultFS=hdfs://localhost:9000dfs.replication=1(单机/伪分布式),dfs.namenode.name.dir 与 dfs.datanode.data.dir 指向本地存储目录mapreduce.framework.name=yarnyarn.nodemanager.aux-services=mapreduce_shufflehdfs namenode -formatstart-dfs.sh && start-yarn.shjps 应看到 NameNode、DataNode、ResourceManager、NodeManager;访问 http://localhost:9870(HDFS)与 http://localhost:8088(YARN)。二 高效执行的关键配置
yarn.nodemanager.resource.memory-mb:为 NodeManager 分配的总内存(如:8192 MB)yarn.nodemanager.resource.cpu-vcores:为 NodeManager 分配的 vCore 数(如:8)yarn.scheduler.minimum-allocation-mb、yarn.scheduler.maximum-allocation-mbdfs.blocksize(如:256MB 对应 268435456)mapreduce.job.maps、mapreduce.job.reducesmapreduce.map.memory.mb、mapreduce.reduce.memory.mbmapreduce.map.java.opts、mapreduce.reduce.java.opts 设置堆大小(通常略小于容器内存,留出堆外开销)mapreduce.map.output.compress=truemapreduce.output.fileoutputformat.compress=truemapreduce.job.locality.wait,提升 Map 任务数据本地率,降低网络开销。三 作业提交与调度
hadoop jar /path/to/your-job.jar com.example.YourJobClass input output-D mapreduce.job.reduces=200 -D mapreduce.map.output.compress=trueyarn application -list、yarn application -status <app_id>四 快速性能调优示例
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>16384</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>16</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>268435456</value> <!-- 256MB -->
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>200</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>
</property>
<property>
<name>mapreduce.job.locality.wait</name>
<value>5000</value>
</property>
hadoop jar your-job.jar com.example.YourJobClass \
-D mapreduce.job.reduces=200 \
-D mapreduce.map.output.compress=true \
input output