Hadoop在Linux系统集成其他工具的通用流程与具体实现
在Linux环境下集成Hadoop与其他工具(如Spark、Hive、Sqoop等),需先完成以下基础配置:
JAVA_HOME环境变量(如export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_221,并添加至PATH)。ssh-keygen -t rsa生成密钥,将公钥(id_rsa.pub)追加至authorized_keys文件(cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys),并设置600权限。/opt/hadoop),配置核心文件(core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml),格式化HDFS(hdfs namenode -format),启动HDFS(start-dfs.sh)和YARN(start-yarn.sh)服务。Spark作为内存计算框架,可与Hadoop的HDFS(存储)、YARN(资源管理)深度集成,提升数据处理效率。
spark-defaults.conf文件,添加HDFS配置(如spark.hadoop.fs.defaultFS=hdfs://namenode:8020,替换为实际NameNode地址)和副本数(如spark.hadoop.fs.dfs.replication=3)。spark-defaults.conf中设置spark.master=yarn,提交作业时通过--master yarn指定集群管理器(如spark-submit --master yarn --class com.example.WordCount --num-executors 10 --executor-memory 2g --executor-cores 2 my-spark-app.jar)。val data = sc.textFile("hdfs://namenode:8020/path/to/data")),进行单词计数(val wordCounts = data.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)),将结果写回HDFS(wordCounts.saveAsTextFile("hdfs://namenode:8020/path/to/output"))。Hive作为数据仓库工具,依赖HDFS存储数据,通过HiveServer2提供SQL查询接口。
hive-site.xml,设置元数据存储路径(如javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=metastore_db;create=true,开发环境可使用Derby嵌入式数据库;生产环境建议用MySQL)和数据仓库目录(如hive.metastore.warehouse.dir=/user/hive/warehouse,需指向HDFS路径)。hive-site.xml中fs.defaultFS与Hadoop的core-site.xml一致(如hdfs://namenode:8020),使Hive能访问HDFS数据。hive --service hiveserver2 &,启动后可通过beeline或JDBC客户端连接(如beeline -u "jdbc:hive2://namenode:10000/default")。CREATE TABLE employees (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';创建表,加载HDFS数据(LOAD DATA INPATH '/path/to/employees.csv' INTO TABLE employees;),执行SQL查询(SELECT name, COUNT(*) FROM employees GROUP BY name;)。Sqoop用于在Hadoop与关系型数据库(如MySQL、Oracle)之间高效传输批量数据。
sqoop-env.sh(设置HADOOP_COMMON_HOME、HADOOP_MAPRED_HOME、HIVE_HOME等环境变量)。mysql-connector-java-8.0.28.jar)复制至Sqoop的lib目录。sqoop import --connect jdbc:mysql://localhost:3306/mydb --username root --password 123456 --table employees --target-dir /user/hive/warehouse/employees --m 1(将MySQL的employees表导入HDFS的/user/hive/warehouse/employees目录)。sqoop export --connect jdbc:mysql://localhost:3306/mydb --username root --password 123456 --table employees_export --export-dir /user/hive/warehouse/employees_result --input-fields-terminated-by ','(将HDFS的/user/hive/warehouse/employees_result目录数据导出到MySQL的employees_export表)。Flume用于从日志文件、Kafka等源头实时采集数据,写入HDFS或Hive。
flume-env.sh(设置JAVA_HOME)。flume.conf文件,定义数据源(如exec source读取日志文件)、通道(如memory channel暂存数据)、 sink(如hdfs sink写入HDFS)。示例配置:agent.sources = logSource
agent.channels = memChannel
agent.sinks = hdfsSink
agent.sources.logSource.type = exec
agent.sources.logSource.command = tail -F /var/log/app.log
agent.channels.memChannel.type = memory
agent.channels.memChannel.capacity = 1000
agent.channels.memChannel.transactionCapacity = 100
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = hdfs://namenode:8020/flume/logs/%Y-%m-%d
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.writeFormat = Text
agent.sinks.hdfsSink.channel = memChannel
flume-ng agent --conf-file flume.conf --name agent -Dflume.root.logger=INFO,console,开始采集日志数据并写入HDFS。hdfs dfs -ls /查看HDFS文件,hive -e "SHOW TABLES;"查看Hive表,spark-shell读取HDFS数据并执行简单计算)。dfs.replication)、块大小(dfs.blocksize)以适应数据访问模式。yarn.scheduler.maximum-allocation-mb、yarn.scheduler.maximum-allocation-vcores)以提高资源利用率。spark.executor.memory、spark.driver.memory)、并行度(spark.default.parallelism)以提升计算效率。