首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Spark (Scala)结构化流聚合和自连接

Spark (Scala)结构化流聚合和自连接
EN

Stack Overflow用户
提问于 2018-09-12 17:04:28
回答 2查看 2.5K关注 0票数 4

我正在尝试在结构化流DataFrame上执行一个聚合,然后是一个自连接。让我们假设df如下所示:

代码语言:javascript
复制
sourceDf.show(false)
+-----+-------+
|owner|fruits |
+-----+-------+
|Brian|apple  |
|Brian|pear   |
|Brian|melon  |
|Brian|avocado|
|Bob  |avocado|
|Bob  |apple  |
+-----+-------+

在静态DataFrame上,很简单:

代码语言:javascript
复制
val aggDf = sourceDf.groupBy($"owner").agg(collect_list(col("fruits")) as "fruitsA")
sourceDf.join(aggDf, Seq("owner")).show(false)
+-----+-------+-----------------------------+
|owner|fruits |fruitsA                      |
+-----+-------+-----------------------------+
|Brian|apple  |[apple, pear, melon, avocado]|
|Brian|pear   |[apple, pear, melon, avocado]|
|Brian|melon  |[apple, pear, melon, avocado]|
|Brian|avocado|[apple, pear, melon, avocado]|
|Bob  |avocado|[avocado, apple]             |
|Bob  |apple  |[avocado, apple]             |
+-----+-------+-----------------------------+

不幸的是,在流DataFrame的情况下,我无法弄清楚如何做到这一点。因此,我尝试使用以下完整的代码,将Kafka用于Source和Sink:

代码语言:javascript
复制
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StringType, StructType}


object Test {

  val spark: SparkSession = SparkSession.builder().getOrCreate()
  import spark.implicits._

  val brokers = "kafka:9092"

  val inputTopic = "test.kafka.sink.input"
  val aggTopic = "test.kafka.sink.agg"
  val outputTopicSelf = "test.kafka.sink.output.self"
  val outputTopicTwo = "test.kafka.sink.output.two"

  val payloadSchema: StructType = new StructType()
    .add("owner", StringType)
    .add("fruits", StringType)

  val payloadSchemaA: StructType = new StructType()
    .add("owner", StringType)
    .add("fruitsA", StringType)

  var joinedDfSchema: StructType = _

  val sourceDf: DataFrame = Seq(
    ("Brian", "apple"),
    ("Brian", "pear"),
    ("Brian", "melon"),
    ("Brian", "avocado"),
    ("Bob", "avocado"),
    ("Bob", "apple")
  )
    .toDF("owner", "fruits")

  val additionalData: DataFrame = Seq(("Bob", "grapes")).toDF("owner", "fruits")

  def saveDfToKafka(df: DataFrame): Unit = {
    df
      .select(to_json(struct(df.columns.map(column): _*)).alias("value"))
      .write
      .format("kafka")
      .option("kafka.bootstrap.servers", brokers)
      .option("topic", inputTopic)
      .save()
  }

  // save data to kafka (batch)
  saveDfToKafka(sourceDf)

  // kafka source
  val farmDF: DataFrame = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", brokers)
    .option("startingOffsets", "earliest")
    .option("subscribe", inputTopic)
    .load()
    .byteArrayToString("value")
    .withColumn("value", from_json($"value", payloadSchema))
    .expand("value")

  farmDF.printSchema()

  implicit class DFHelper(df: DataFrame) {
    def expand(column: String): DataFrame = {
      val wantedColumns = df.columns.filter(_ != column) :+ s"$column.*"
      df.select(wantedColumns.map(col): _*)
    }

    def byteArrayToString(column: String): DataFrame = {
      val selectedCols = df.columns.filter(_ != column) :+ s"CAST($column AS STRING)"
      df.selectExpr(selectedCols: _*)
    }
  }

  def testSelfAggJoinFail(): Unit = {
    // aggregated df
    val myFarmDF = farmDF
      .groupBy($"owner")
      .agg(collect_list(col("fruits")) as "fruitsA")

    // joined df
    val joinedDF = farmDF
      .join(myFarmDF.as("myFarmDF"), Seq("owner"))
      .select("owner", "fruits", "myFarmDF.fruitsA")

    joinedDfSchema = joinedDF.schema

    // stream sink
    joinedDF
      .select(to_json(struct(joinedDF.columns.map(column): _*)).alias("value"))
      .writeStream
      .outputMode("append")
      .option("kafka.bootstrap.servers", brokers)
      .option("checkpointLocation", "/data/kafka/checkpointSelf")
      .option("topic", outputTopicSelf)
      .format("kafka")
      .start()

    // let's give time to process the stream
    Thread.sleep(10000)
  }

  def testSelfAggJoin(): Unit = {
    // aggregated df
    val myFarmDF = farmDF
      .withWatermark("timestamp", "30 seconds")
      .groupBy(
        window($"timestamp", "30 seconds", "15 seconds"),
        $"owner"
      )
      .agg(collect_list(col("fruits")) as "fruitsA")
      .select("owner", "fruitsA", "window")

    // joined df
    val joinedDF = farmDF
        .as("farmDF")
      .withWatermark("timestamp", "30 seconds")
      .join(
        myFarmDF.as("myFarmDF"),
        expr(
          """
            |farmDF.owner = myFarmDF.owner AND
            |farmDF.timestamp >= myFarmDF.window.start AND
            |farmDF.timestamp <= myFarmDF.window.end
          """.stripMargin))
      .select("farmDF.owner", "farmDF.fruits", "myFarmDF.fruitsA")

    joinedDfSchema = joinedDF.schema

    // stream sink
    joinedDF
      .select(to_json(struct(joinedDF.columns.map(column): _*)).alias("value"))
      .writeStream
      .outputMode("append")
      .option("kafka.bootstrap.servers", brokers)
      .option("checkpointLocation", "/data/kafka/checkpointSelf")
      .option("topic", outputTopicSelf)
      .format("kafka")
      .start()

    // let's give time to process the stream
    Thread.sleep(10000)
  }

  def testTwoDfAggJoin(): Unit = {
    // aggregated df
    val myFarmDF = farmDF
      .withWatermark("timestamp", "30 seconds")
      .groupBy(
        $"owner"
      )
      .agg(collect_list(col("fruits")) as "fruitsA")
      .select("owner", "fruitsA")

    // save the aggregated df to kafka
    myFarmDF
      .select(to_json(struct(myFarmDF.columns.map(column):_*)).alias("value"))
      .writeStream
      .outputMode("update")
      .option("kafka.bootstrap.servers", brokers)
      .option("checkpointLocation", "/data/kafka/checkpointAgg")
      .option("topic", aggTopic)
      .format("kafka")
      .start()

    // let's give time to process the stream
    Thread.sleep(10000)

    // read the aggregated df from kafka as a stream
    val aggDF = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", brokers)
      .option("startingOffsets", "earliest")
      .option("subscribe", aggTopic)
      .load()
      .byteArrayToString("value")
      .withColumn("value", from_json($"value", payloadSchemaA))
      .expand("value")
      .withWatermark("timestamp", "30 seconds")

    // joined df
    val joinedDF = farmDF
      .as("farmDF")
      .join(
        aggDF.as("myFarmDF"),
        expr(
          """
            |farmDF.owner = myFarmDF.owner AND
            |farmDF.timestamp >= myFarmDF.timestamp - interval 1 hour AND
            |farmDF.timestamp <= myFarmDF.timestamp + interval 1 hour
          """.stripMargin))
      .select("farmDF.owner", "myFarmDF.fruitsA", "farmDF.fruits")

    joinedDfSchema = joinedDF.schema

    // stream sink
    joinedDF
      .select(to_json(struct(joinedDF.columns.map(column):_*)).alias("value"))
      .writeStream
      .outputMode("append")
      .option("kafka.bootstrap.servers", brokers)
      .option("checkpointLocation", "/data/kafka/checkpointTwo")
      .option("topic", outputTopicTwo)
      .format("kafka")
      .start()

    // let's give time to process the stream
    Thread.sleep(10000)
  }

  def data(topic: String): DataFrame = {
    // let's read back the output topic using kafka batch
    spark
      .read
      .format("kafka")
      .option("kafka.bootstrap.servers", brokers)
      .option("subscribe", topic)
      .load()
      .byteArrayToString("value")
      .withColumn("value", from_json($"value", joinedDfSchema))
      .expand("value")
  }
}

现在,如果我在流DataFrame上测试

代码语言:javascript
复制
scala> Test.testSelfAggJoinFail
org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Project [structstojson(named_struct(owner, owner#59, fruits, fruits#60, fruitsA, fruitsA#78), Some(Etc/UTC)) AS value#96]
+- Project [owner#59, fruits#60, fruitsA#78]
   +- Project [owner#59, key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, fruits#60, fruitsA#78]
      +- Join Inner, (owner#59 = owner#82)
         :- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, value#51.owner AS owner#59, value#51.fruits AS fruits#60]
         :  +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, jsontostructs(StructField(owner,StringType,true), StructField(fruits,StringType,true), value#43, Some(Etc/UTC), true) AS value#51]
         :     +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, cast(value#30 as string) AS value#43]
         :        +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@3269e790, kafka, Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092), [key#29, value#30, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@42eeb996,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092),None), kafka, [key#22, value#23, topic#24, partition#25, offset#26L, timestamp#27, timestampType#28]
         +- SubqueryAlias myFarmDF
            +- Aggregate [owner#82], [owner#82, collect_list(fruits#83, 0, 0) AS fruitsA#78]
               +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, value#51.owner AS owner#82, value#51.fruits AS fruits#83]
                  +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, jsontostructs(StructField(owner,StringType,true), StructField(fruits,StringType,true), value#43, Some(Etc/UTC), true) AS value#51]
                     +- Project [key#29, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35, cast(value#30 as string) AS value#43]
                        +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@3269e790, kafka, Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092), [key#29, value#30, topic#31, partition#32, offset#33L, timestamp#34, timestampType#35], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@42eeb996,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> test.kafka.sink.input, kafka.bootstrap.servers -> kafka:9092),None), kafka, [key#22, value#23, topic#24, partition#25, offset#26L, timestamp#27, timestampType#28]

  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
  at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:110)
  at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:235)
  at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:299)
  at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:296)
  at Test$.testSelfAggJoinFail(<console>:123)
  ... 51 elided

它在Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark中失败了,因为我不使用水印。

现在,如果我可以运行第二个测试

代码语言:javascript
复制
Test.testSelfAggJoin

我收到这些警告

代码语言:javascript
复制
2018-09-12 16:07:33 WARN  StreamingJoinHelper:66 - Failed to extract state value watermark from condition (window#70-T30000ms.start - timestamp#139-T30000ms) due to window#70-T30000ms.start
2018-09-12 16:07:33 WARN  StreamingJoinHelper:66 - Failed to extract state value watermark from condition (timestamp#139-T30000ms - window#70-T30000ms.end) due to window#70-T30000ms.end
2018-09-12 16:07:33 WARN  StreamingJoinHelper:66 - Failed to extract state value watermark from condition (window#70-T30000ms.start - timestamp#139-T30000ms) due to window#70-T30000ms.start
2018-09-12 16:07:33 WARN  StreamingJoinHelper:66 - Failed to extract state value watermark from condition (timestamp#139-T30000ms - window#70-T30000ms.end) due to window#70-T30000ms.end

我可以用

代码语言:javascript
复制
Test.data(Test.outputTopicSelf).show(false)
2018-09-12 16:08:01 WARN  NetworkClient:882 - [Consumer clientId=consumer-5, groupId=spark-kafka-relation-02f5512f-cc3c-40ad-938f-e3dfdca95f8c-driver-0] Error while fetching metadata with correlation id 2 : {test.kafka.sink
.output.self=LEADER_NOT_AVAILABLE}
2018-09-12 16:08:01 WARN  NetworkClient:882 - [Consumer clientId=consumer-5, groupId=spark-kafka-relation-02f5512f-cc3c-40ad-938f-e3dfdca95f8c-driver-0] Error while fetching metadata with correlation id 6 : {test.kafka.sink
.output.self=LEADER_NOT_AVAILABLE}
+---+-----+---------+------+---------+-------------+-----+------+-------+
|key|topic|partition|offset|timestamp|timestampType|owner|fruits|fruitsA|
+---+-----+---------+------+---------+-------------+-----+------+-------+
+---+-----+---------+------+---------+-------------+-----+------+-------+

哪一个返回空的DataFrame (可能是因为警告?)。我无法找到一个自我连接的解决方案。

最后,我尝试将聚合转移到Kafka,并将其作为第二个流DataFrame重新读取,如

代码语言:javascript
复制
scala> Test.data(Test.outputTopicTwo).show(false)
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|key |topic                     |partition|offset|timestamp              |timestampType|owner|fruitsA                           |fruits |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|null|test.kafka.sink.output.two|0        |0     |2018-09-12 16:57:04.376|0            |Brian|["avocado","apple","pear","melon"]|avocado|
|null|test.kafka.sink.output.two|0        |1     |2018-09-12 16:57:04.376|0            |Bob  |["apple","avocado"]               |apple  |
|null|test.kafka.sink.output.two|0        |2     |2018-09-12 16:57:04.38 |0            |Brian|["avocado","apple","pear","melon"]|apple  |
|null|test.kafka.sink.output.two|0        |3     |2018-09-12 16:57:04.38 |0            |Bob  |["apple","avocado"]               |avocado|
|null|test.kafka.sink.output.two|0        |4     |2018-09-12 16:57:04.381|0            |Brian|["avocado","apple","pear","melon"]|pear   |
|null|test.kafka.sink.output.two|0        |5     |2018-09-12 16:57:04.382|0            |Brian|["avocado","apple","pear","melon"]|melon  |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+

它可以工作(虽然效率不高,我想说),但是如果我向源主题添加额外的数据:

代码语言:javascript
复制
scala> Test.saveDfToKafka(Test.additionalData)
scala> Test.data(Test.outputTopicTwo).show(false)
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|key |topic                     |partition|offset|timestamp              |timestampType|owner|fruitsA                           |fruits |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+
|null|test.kafka.sink.output.two|0        |0     |2018-09-12 16:57:04.376|0            |Brian|["avocado","apple","pear","melon"]|avocado|
|null|test.kafka.sink.output.two|0        |1     |2018-09-12 16:57:04.376|0            |Bob  |["apple","avocado"]               |apple  |
|null|test.kafka.sink.output.two|0        |2     |2018-09-12 16:57:04.38 |0            |Brian|["avocado","apple","pear","melon"]|apple  |
|null|test.kafka.sink.output.two|0        |3     |2018-09-12 16:57:04.38 |0            |Bob  |["apple","avocado"]               |avocado|
|null|test.kafka.sink.output.two|0        |4     |2018-09-12 16:57:04.381|0            |Brian|["avocado","apple","pear","melon"]|pear   |
|null|test.kafka.sink.output.two|0        |5     |2018-09-12 16:57:04.382|0            |Brian|["avocado","apple","pear","melon"]|melon  |
|null|test.kafka.sink.output.two|0        |6     |2018-09-12 16:59:37.125|0            |Bob  |["apple","avocado"]               |grapes |
|null|test.kafka.sink.output.two|0        |7     |2018-09-12 16:59:40.001|0            |Bob  |["apple","avocado","grapes"]      |apple  |
|null|test.kafka.sink.output.two|0        |8     |2018-09-12 16:59:40.002|0            |Bob  |["apple","avocado","grapes"]      |avocado|
|null|test.kafka.sink.output.two|0        |9     |2018-09-12 16:59:40.002|0            |Bob  |["apple","avocado","grapes"]      |grapes |
+----+--------------------------+---------+------+-----------------------+-------------+-----+----------------------------------+-------+

我得到了更多的行,可能是因为我不得不在构建聚合Df的同时使用.outputMode("update")

  • 有没有一种方法可以在不将聚合作为单独的主题发送回Kafka的情况下执行此聚合?
  • 如果没有,是否可以修改testTwoDfAggJoin以使用.outputMode("append")
EN

回答 2

Stack Overflow用户

发布于 2019-06-27 13:02:25

在Spark2.3中,当在联接之前包含一些聚合函数时,就不可能连接两个流DF。

从火花文档

代码语言:javascript
复制
Additional details on supported joins:

    Joins can be cascaded, that is, you can do df1.join(df2, ...).join(df3, ...).join(df4, ....).

    As of Spark 2.3, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.

    As of Spark 2.3, you cannot use other non-map-like operations before joins. Here are a few examples of what cannot be used.

        Cannot use streaming aggregations before joins.

        Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.
票数 3
EN

Stack Overflow用户

发布于 2018-11-13 11:53:43

我遇到了类似的错误信息,outputModeagg很重要,我通过添加df.writeStream.outputMode(OutputMode.Update())df.writeStream.outputMode(OutputMode.Complete())来解决问题。

参考:

输出模式有几种输出模式。 追加模式(默认)-这是默认模式,其中只有新的行添加到结果表,因为最后的触发器将输出到接收器。这只支持那些添加到结果表中的行永远不会更改的查询。因此,这种模式保证每一行只输出一次(假设是容错接收器)。例如,只有select、where、map、flatMap、filter、join等的查询将支持附加模式。 完成模式-整个结果表将在每次触发后输出到接收器。聚合查询支持这一点。 Update模式-(自Spark2.1.1起可用)只有自上次触发器以来更新的结果表中的行才会输出到接收器。将在未来版本中添加更多信息。

http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-3/

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/52300247

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档