欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 财经 > 创投人物 > SpringBoot操作spark处理hdfs文件

SpringBoot操作spark处理hdfs文件

2025/2/25 11:18:32 来源:https://blog.csdn.net/weixin_51713937/article/details/145002030  浏览:    关键词:SpringBoot操作spark处理hdfs文件

SpringBoot操作spark处理hdfs文件

  • 在这里插入图片描述

1、导入依赖

  • <!--        spark依赖--><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.12</artifactId><version>3.2.2</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.12</artifactId><version>3.2.2</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib --><dependency><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.12</artifactId><version>3.2.2</version></dependency>
    

2、配置spark信息

  • 建立一个配置文件,配置spark信息
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;//将文件交于spring管理
@Configuration
public class SparkConfig {//使用yml中的配置@Value("${spark.master}")private String sparkMaster;@Value("${spark.appName}")private String sparkAppName;@Value("${hdfs.user}")private String hdfsUser;@Value("${hdfs.path}")private String hdfsPath;@Beanpublic SparkConf sparkConf() {SparkConf conf = new SparkConf();conf.setMaster(sparkMaster);conf.setAppName(sparkAppName);// 添加HDFS配置conf.set("fs.defaultFS", hdfsPath);conf.set("spark.hadoop.hdfs.user",hdfsUser);return conf;}@Beanpublic SparkSession sparkSession() {return SparkSession.builder().config(sparkConf()).getOrCreate();}
}

3、controller和service

  • controller类

    • import org.springframework.beans.factory.annotation.Autowired;
      import org.springframework.web.bind.annotation.GetMapping;
      import org.springframework.web.bind.annotation.RequestMapping;
      import org.springframework.web.bind.annotation.RestController;
      import xyz.zzj.traffic_main_code.service.SparkService;@RestController
      @RequestMapping("/spark")
      public class SparkController {@Autowiredprivate SparkService sparkService;@GetMapping("/run")public String runSparkJob() {//读取Hadoop HDFS文件String filePath = "hdfs://192.168.44.128:9000/subwayData.csv";sparkService.executeHadoopSparkJob(filePath);return "Spark job executed successfully!";}
      }
      
  • 处理地铁数据的service

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import xyz.zzj.traffic_main_code.service.SparkReadHdfs;import java.io.IOException;
import java.net.URI;
import static org.apache.spark.sql.functions.*;@Service
public class SparkReadHdfsImpl implements SparkReadHdfs {private final SparkSession spark;@Value("${hdfs.user}")private String hdfsUser;@Value("${hdfs.path}")private String hdfsPath;@Autowiredpublic SparkReadHdfsImpl(SparkSession spark) {this.spark = spark;}/*** 读取HDFS上的CSV文件并上传到HDFS* @param filePath*/@Overridepublic void sparkSubway(String filePath) {try {// 设置Hadoop配置JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());Configuration hadoopConf = jsc.hadoopConfiguration();hadoopConf.set("fs.defaultFS", hdfsPath);hadoopConf.set("hadoop.user.name", hdfsUser);// 读取HDFS上的文件Dataset<Row> df = spark.read().option("header", "true") // 指定第一行是列名.option("inferSchema", "true") // 自动推断列的数据类型.csv(filePath);// 显示DataFrame的所有数据
//            df.show(Integer.MAX_VALUE, false);// 对DataFrame进行清洗和转换操作// 检查缺失值df.select("number", "people", "dateTime").na().drop().show();// 对数据进行类型转换Dataset<Row> df2 = df.select(col("number").cast(DataTypes.IntegerType),col("people").cast(DataTypes.IntegerType),to_date(col("dateTime"), "yyyy年MM月dd日").alias("dateTime"));// 去重Dataset<Row> df3 = df2.dropDuplicates();// 数据过滤,确保people列没有负数Dataset<Row> df4 = df3.filter(col("people").geq(0));
//            df4.show();// 数据聚合,按dateTime分组,统计每天的总客流量Dataset<Row> df6 = df4.groupBy("dateTime").agg(sum("people").alias("total_people"));
//            df6.show();sparkForSubway(df6,"/time_subwayData.csv");//数据聚合,获取每天人数最多的地铁numberDataset<Row> df7 = df4.groupBy("dateTime").agg(max("people").alias("max_people"));sparkForSubway(df7,"/everyday_max_subwayData.csv");//数据聚合,计算每天的客流强度:每天总people除以632840Dataset<Row> df8 = df4.groupBy("dateTime").agg(sum("people").divide(632.84).alias("strength"));sparkForSubway(df8,"/everyday_strength_subwayData.csv");} catch (Exception e) {e.printStackTrace();}}private static void sparkForSubway(Dataset<Row> df6, String hdfsPath) throws IOException {// 保存处理后的数据到HDFSdf6.coalesce(1).write().mode("overwrite").option("header", "true").csv("hdfs://192.168.44.128:9000/time_subwayData");// 创建Hadoop配置Configuration conf = new Configuration();// 获取FileSystem实例FileSystem fs = FileSystem.get(URI.create("hdfs://192.168.44.128:9000"), conf);// 定义临时目录和目标文件路径Path tempDir = new Path("/time_subwayData");FileStatus[] files = fs.listStatus(tempDir);// 检查目标文件是否存在,如果存在则删除Path targetFile1 = new Path(hdfsPath);if (fs.exists(targetFile1)) {fs.delete(targetFile1, true); // true 表示递归删除}for (FileStatus file : files) {if (file.isFile() && file.getPath().getName().startsWith("part-")) {Path targetFile = new Path(hdfsPath);fs.rename(file.getPath(), targetFile);}}// 删除临时目录fs.delete(tempDir, true);}}

4、运行

  • 项目运行完后,打开浏览器
    • spark处理地铁数据
      • http://localhost:8686/spark/dispose
  • 观察spark和hdfs
    • http://192.168.44.128:8099/
    • http://192.168.44.128:9870/explorer.html#/
      • image-20250109095551610

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词