我在连接spark结构化流数据帧和批数据帧时遇到了一个问题,在我的场景中,我有一个S3流,它需要与历史数据进行左反连接,它返回历史中不存在的记录(计算出新的记录),并将这些记录作为新的追加写入历史(按列分区磁盘数据分区而不是内存)。
当我刷新已分区的历史数据框架时,我的历史数据框架不会更新。
下面是两个代码片段,一个可以工作,另一个不能工作。
工作代码和非工作代码之间唯一的区别是partition_by子句。
工作代码:-(历史记录被刷新)
import spark.implicits._
val inputSchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val historySchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val source = spark.readStream
.schema(inputSchema)
.option("header", "false")
.csv("src/main/resources/Input/")
val history = spark.read
.schema(inputSchema)
.option("header", "true")
.csv("src/main/resources/history/")
.withColumnRenamed("spark_id", "spark_id_2")
.withColumnRenamed("account_id", "account_id_2")
.withColumnRenamed("run_dt", "run_dt_2")
.withColumnRenamed("trxn_ref_id", "trxn_ref_id_2")
.withColumnRenamed("trxn_dt", "trxn_dt_2")
.withColumnRenamed("trxn_amt", "trxn_amt_2")
val readFilePersisted = history.persist()
readFilePersisted.createOrReplaceTempView("hist")
val recordsNotPresentInHist = source
.join(
history,
source.col("account_id") === history.col("account_id_2") &&
source.col("run_dt") === history.col("run_dt_2") &&
source.col("trxn_ref_id") === history.col("trxn_ref_id_2") &&
source.col("trxn_dt") === history.col("trxn_dt_2") &&
source.col("trxn_amt") === history.col("trxn_amt_2"),
"leftanti"
)
recordsNotPresentInHist.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write
.mode(SaveMode.Append)
//.partitionBy("spark_id", "account_id", "run_dt")
.csv("src/main/resources/history/")
val lkpChacheFileDf1 = spark.read
.schema(inputSchema)
.parquet("src/main/resources/history")
val lkpChacheFileDf = lkpChacheFileDf1
lkpChacheFileDf.unpersist(true)
val histLkpPersist = lkpChacheFileDf.persist()
histLkpPersist.createOrReplaceTempView("hist")
}
.start()
println("This is the kafka dataset:")
source
.withColumn("Input", lit("Input-source"))
.writeStream
.format("console")
.outputMode("append")
.start()
recordsNotPresentInHist
.withColumn("reject", lit("recordsNotPresentInHist"))
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()不工作:-(历史记录不会被刷新)
import spark.implicits._
val inputSchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val historySchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val source = spark.readStream
.schema(inputSchema)
.option("header", "false")
.csv("src/main/resources/Input/")
val history = spark.read
.schema(inputSchema)
.option("header", "true")
.csv("src/main/resources/history/")
.withColumnRenamed("spark_id", "spark_id_2")
.withColumnRenamed("account_id", "account_id_2")
.withColumnRenamed("run_dt", "run_dt_2")
.withColumnRenamed("trxn_ref_id", "trxn_ref_id_2")
.withColumnRenamed("trxn_dt", "trxn_dt_2")
.withColumnRenamed("trxn_amt", "trxn_amt_2")
val readFilePersisted = history.persist()
readFilePersisted.createOrReplaceTempView("hist")
val recordsNotPresentInHist = source
.join(
history,
source.col("account_id") === history.col("account_id_2") &&
source.col("run_dt") === history.col("run_dt_2") &&
source.col("trxn_ref_id") === history.col("trxn_ref_id_2") &&
source.col("trxn_dt") === history.col("trxn_dt_2") &&
source.col("trxn_amt") === history.col("trxn_amt_2"),
"leftanti"
)
recordsNotPresentInHist.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write
.mode(SaveMode.Append)
.partitionBy("spark_id", "account_id","run_dt")
.csv("src/main/resources/history/")
val lkpChacheFileDf1 = spark.read
.schema(inputSchema)
.parquet("src/main/resources/history")
val lkpChacheFileDf = lkpChacheFileDf1
lkpChacheFileDf.unpersist(true)
val histLkpPersist = lkpChacheFileDf.persist()
histLkpPersist.createOrReplaceTempView("hist")
}
.start()
println("This is the kafka dataset:")
source
.withColumn("Input", lit("Input-source"))
.writeStream
.format("console")
.outputMode("append")
.start()
recordsNotPresentInHist
.withColumn("reject", lit("recordsNotPresentInHist"))
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()感谢Sri
发布于 2021-06-02 23:38:20
我通过使用union by name函数而不是从磁盘读取刷新的数据解决了这个问题。
第1步:-读取历史S3
第2步:-阅读卡夫卡并查看历史
步骤3:-将处理后的数据写入磁盘,并使用联合按名称spark函数附加到步骤1中创建的数据帧。
步骤1代码(读取历史数据帧):-
val acctHistDF = sparkSession.read
.schema(schema)
.parquet(S3path)
val acctHistDFPersisted = acctHistDF.persist()
acctHistDFPersisted.createOrReplaceTempView("acctHist")步骤2(使用流数据刷新历史数据帧):-
val history = sparkSession.table("acctHist")
history.unionByName(stream)
history.createOrReplaceTempView("acctHist")感谢Sri
https://stackoverflow.com/questions/66911985
复制相似问题