在 Ubuntu 上调度 Hadoop 作业的实用方案
一、调度层次与适用场景
二、快速上手路径
#!/usr/bin/env bash
set -e
INPUT=/data/input/$(date -d "yesterday" +%Y-%m-%d)
OUTPUT=/data/output/wordcount/$(date +%Y-%m-%d)
hadoop fs -rm -r -f "$OUTPUT"
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount "$INPUT" "$OUTPUT"
chmod +x run_wordcount.sh
# 每天 02:00 执行
(crontab -l 2>/dev/null; echo "0 2 * * * /opt/scripts/run_wordcount.sh >> /var/log/hadoop/wordcount.log 2>&1") | crontab -
type=command
command=/opt/scripts/run_wordcount.sh
三、YARN 队列与资源调度配置
<property>
<name>yarn.scheduler.capacity.root.hive.capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hive.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hive.maximum-capacity</name>
<value>80</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hive.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name>
<value>*</value>
</property>
分发配置后执行:yarn rmadmin -refreshQueues
提交作业到指定队列:hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
wordcount -D mapreduce.job.queuename=hive /input /output
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>/etc/hadoop/fair-scheduler.xml</value>
</property>
在 fair-scheduler.xml 定义队列、权重、最小/最大资源、抢占等策略,然后分发并重启 ResourceManager 或刷新队列。四、监控与运维要点