在集成前需确保系统满足以下基础要求,避免兼容性问题:
通过以下命令快速安装Java:
sudo apt update
sudo apt install openjdk-11-jdk -y
java -version # 验证安装(需显示Java版本信息)
从Apache官网下载Hadoop(如3.3.1版本),解压至指定目录:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
tar -xzf hadoop-3.3.1.tar.gz
sudo mv hadoop-3.3.1 /usr/local/hadoop
编辑Hadoop配置目录(/usr/local/hadoop/etc/hadoop)下的关键文件:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///usr/local/hadoop/hdfs/datanode</value>
</property>
</configuration>
格式化NameNode(首次启动需执行,删除原有数据):
/usr/local/hadoop/bin/hdfs namenode -format
启动HDFS集群:
/usr/local/hadoop/sbin/start-dfs.sh
验证服务状态(应看到NameNode、DataNode进程):
jps
从Apache官网下载与Hadoop版本兼容的Spark(如3.2.0-bin-hadoop3.2.tgz),解压至指定目录:
wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar -xzf spark-3.2.0-bin-hadoop3.2.tgz
sudo mv spark-3.2.0-bin-hadoop3.2 /usr/local/spark
编辑~/.bashrc文件,添加Spark和Hadoop的环境变量:
export SPARK_HOME=/usr/local/spark
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin
使配置生效:
source ~/.bashrc
编辑Spark配置文件(/usr/local/spark/conf/spark-defaults.conf),添加HDFS相关参数:
spark.master local[*] # 本地模式,使用所有核心
spark.hadoop.fs.defaultFS hdfs://localhost:9000 # 指向HDFS的默认文件系统
spark.executor.memory 4g # Executor内存(根据机器配置调整)
若需启动Spark Standalone集群(非YARN模式),执行:
/usr/local/spark/sbin/start-master.sh
/usr/local/spark/sbin/start-worker.sh spark://localhost:7077
验证Spark状态(访问Web UI:http://localhost:8080)。
将本地文件(如test.txt)上传至HDFS:
echo -e "Hello\nSpark\nHDFS Integration Test" > ~/test.txt
/usr/local/hadoop/bin/hdfs dfs -put ~/test.txt /user/hadoop/
启动Spark Shell:
spark-shell
在Shell中执行以下命令,读取HDFS文件并统计行数:
val hdfsFile = sc.textFile("hdfs://localhost:9000/user/hadoop/test.txt")
hdfsFile.count() // 应返回文件行数(如3)
hdfsFile.collect().foreach(println) // 打印文件内容
编写Scala应用(如HDFSTest.scala),读取HDFS文件并统计单词数:
import org.apache.spark.{SparkConf, SparkContext}
object HDFSTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HDFS Integration Test")
val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://localhost:9000/user/hadoop/test.txt")
val wordCounts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wordCounts.foreach(println)
sc.stop()
}
}
使用sbt打包应用(需提前安装sbt):
mkdir -p HDFSTest/src/main/scala
vim HDFSTest/src/main/scala/HDFSTest.scala # 粘贴上述代码
vim HDFSTest/simple.sbt # 内容:name := "HDFS Test"; version := "1.0"; scalaVersion := "2.12.15"; libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
sbt package
提交作业到Spark集群:
/usr/local/spark/bin/spark-submit \
--class HDFSTest \
--master local[*] \
HDFSTest/target/scala-2.12/hdfstest_2.12-1.0.jar
core-site.xml中的fs.defaultFS地址是否正确,端口是否被占用;确保hdfs namenode -format已执行。spark-defaults.conf中的spark.hadoop.fs.defaultFS与HDFS配置一致;检查Hadoop服务是否运行(jps查看NameNode、DataNode)。hdfs dfs -chmod -R 777 /user/hadoop临时开放权限(生产环境需合理配置权限)。通过以上步骤,即可在Ubuntu环境下完成HDFS与Spark的集成,并实现Spark对HDFS数据的读写与处理。集成后,Spark可充分利用HDFS的分布式存储能力,处理大规模数据集。