首页手记 SparkSQL操作RDD两种方式对比案例

SparkSQL操作RDD两种方式对比案例

标签：

Spark

前言

GitHub地址：https://github.com/guofei1219

背景

统计新渠道进件数量

SparkSQL操作RDD两种方式对比

1.使用反射推断Schema类型，具体解析参考下面的官网描述

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.

case class blb_intpc_info(chnl_code:String,id_num:String)

2.使用编程方式制定Schema类型，具体解析参考下面的官网描述

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

>1.Create an RDD of Rows from the original RDD;2.Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.2.Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

val structTypes = StructType(Array(
  StructField("chnl_code", StringType, true),
  StructField("id_num", StringType, true)
))

对比总结

1.case class模板类模式可视化比较好
2.case class模板类参数上限为22个，对于字段多的不能使用
3.编程方式更适合日常开发

代码实现

源数据格式

，第一列为渠道代码、第二列为进件ID

306DC4246 411324199209142831
306DC423A 360124199011241838
306DC423D 440802198010290019
306DC4226 612328197403120016
306DC4201 452629199104050312
306DC4201 350212198505025514

反射方式

import org.apache.spark.sql.SQLContextimport org.apache.spark.{SparkConf, SparkContext}/**
  * 统计不同渠道进件数量
  * Created by Michael on 2016/11/29.
  */object Custmer_Statistics_CaseClass {  /**
    * 使用模板类描述表元数据信息
    * @param chnl_code
    * @param id_num
    */
  case class blb_intpc_info(chnl_code:String,id_num:String)

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Custmer_Statistics_CaseClass").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)    //RDD隐式转换成DataFrame
    import sqlContext.implicits._    //读取本地文件
    val blb_intpc_infoDF = sc.textFile("C:/work/ideabench/SparkSQL/data/channel/blb_intpc_info_10000_2.txt")
      .map(_.split("\\t"))
      .map(d => blb_intpc_info(d(0), d(1))).toDF()    //注册表
    blb_intpc_infoDF.registerTempTable("blb_intpc_info")    /**
      * 分渠道进件数量统计并按进件数量降序排列
      */
    blb_intpc_infoDF.registerTempTable("blb_intpc_info")
    sqlContext.sql("" +      "select chnl_code,count(*) as intpc_sum " +      "from blb_intpc_info " +      "group by chnl_code").toDF().sort($"intpc_sum".desc).show()
  }

}

运行结果

+---------+---------+|chnl_code|intpc_sum|
+---------+---------+
|306DC421E|      631||306DC4201|      603|
|306DC422B|      472||306DC4221|      326|
|306DC425E|      280||306DC4237|      277|
|306DC4210|      238||306DC4246|      236|
|306DC4229|      223||306DC4257|      202|
|306DC420E|      197||306DC4215|      183|
|306DC421F|      176||306DC425A|      156|
|306DC4251|      140||306DC4202|      131|
|306DC424D|      125||306DC4226|      122|
|306DC422A|      112||306DC422D|      108|

编程方式

查询Hive元数据库获取Hive 指定表字段信息。
注：对Hive元数据表结构不了解的同学用google搜几篇帖子看看或者参看本文末尾参考文章
不解释了，直接上代码

public static String getHiveMetaData(String hiveTableName) {
    Connection conn = getConn();
    String sql = "SELECT\n" +            "  #TBLS.`TBL_NAME`,\n" +            "  #表名\n" +            "  COLUMNS_V2.`COLUMN_NAME`\n" +            "  #列名\n" +            "  #COLUMNS_V2.`TYPE_NAME` #列类型\n" +            "FROM\n" +            "  TBLS #元数据信息表\n" +            "  LEFT JOIN SDS #数据存储表\n" +            "    ON TBLS.SD_ID = SDS.SD_ID\n" +            "  LEFT JOIN CDS\n" +            "    ON SDS.CD_ID = CDS.CD_ID\n" +            "  LEFT JOIN COLUMNS_V2 #字段信息表\n" +            "    ON CDS.CD_ID = COLUMNS_V2.CD_ID\n" +            "WHERE TBLS.`TBL_NAME` = \"gd_py_corp_sharehd_info\"";
    PreparedStatement pstmt;
    String result="";    try {
        pstmt = (PreparedStatement)conn.prepareStatement(sql);
        ResultSet rs = pstmt.executeQuery();        int col = rs.getMetaData().getColumnCount();        while (rs.next()) {            for (int i = 1; i <= col; i++) {
                result = result + rs.getString(i) + "\t";
            }
        }

    } catch (SQLException e) {
        e.printStackTrace();
    }    return result;
}

import org.apache.spark.sql.{Row, SQLContext}import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}import org.apache.spark.{SparkConf, SparkContext}import utils.DataUtils/**
  * 统计不同渠道进件数量
  * Created by Michael on 2016/11/29.
  */object Custmer_Statistics_StructType {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Custmer_Statistics_StructType").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)    //RDD隐式转换成DataFrame
    import sqlContext.implicits._    //读取本地文件
    val blb_intpc_infoRow = sc.textFile("C:/work/ideabench/SparkSQL/data/channel/blb_intpc_info_10000_2.txt")
      .map(_.split("\\t"))
      .map(d => {
        Row(d(0),d(1))
      })    //Hive表字段元数据信息
    val schemaString = DataUtils.getHiveMetaData("blb_intpc_info")
    val schema =StructType(schemaString.split("\\t")
      .map(fieldName => StructField(fieldName, StringType, true)))

    val blb_intpc_infoDF = sqlContext.createDataFrame(blb_intpc_infoRow,schema)    //注册表
    blb_intpc_infoDF.registerTempTable("blb_intpc_info")    /**
      * 分渠道进件数量统计并按进件数量降序排列
      */
    blb_intpc_infoDF.registerTempTable("blb_intpc_info")
    sqlContext.sql("" +      "select chnl_code,count(*) as intpc_sum " +      "from blb_intpc_info " +      "group by chnl_code").toDF().sort($"intpc_sum".desc).show()
  }

}

作者：MichaelFly
链接：https://www.jianshu.com/p/794ba85707b0

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

慕哥9229398

手记
篇

粉丝

200

获赞与收藏

917

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 32710 369

网络编程入门教程

20个小节 13565 255

Pandas 入门教程

25个小节 20195 383

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空