在Ubuntu上使用Hadoop Streaming可以让你利用Python、Ruby、PHP等非Java语言编写MapReduce程序。以下是详细的步骤指南:
sudo apt-get update
sudo apt-get install python3 python3-pip
假设我们要编写一个简单的WordCount程序。
Mapper (mapper.py):
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print(f"{word}\t1")
Reducer (reducer.py):
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_count = count
current_word = word
if current_word == word:
print(f"{current_word}\t{current_count}")
确保这两个脚本都有执行权限:
chmod +x mapper.py
chmod +x reducer.py
将你的输入数据上传到HDFS中:
hdfs dfs -mkdir /input
hdfs dfs -put /path/to/your/local/input/file.txt /input
使用hadoop jar命令来运行Hadoop Streaming作业:
hadoop jar /path/to/hadoop-streaming.jar \
-files mapper.py,reducer.py \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py" \
-input /input/file.txt \
-output /output
-files:指定要传递给MapReduce任务的文件。-mapper:指定Mapper脚本。-reducer:指定Reducer脚本。-input:指定输入数据的HDFS路径。-output:指定输出数据的HDFS路径。作业完成后,你可以查看输出结果:
hdfs dfs -cat /output/part-r-00000
hadoop-streaming.jar的路径正确。通过以上步骤,你就可以在Ubuntu上使用Hadoop Streaming来运行非Java编写的MapReduce程序了。