我使用的是Java-Spark。
我在rdd中有以下来自Kafka的Java记录(作为字符串):
{"code":"123", "date":"14/07/2018",....}
{"code":"124", "date":"15/07/2018",....}
{"code":"123", "date":"15/07/2018",....}
{"code":"125", "date":"14/07/2018",....}我读到的数据集如下:
Dataset<Row> df = sparkSession.read().json(jsonSet);
Dataset<Row> dfSelect = df.select(cols);//Where cols is Column[]我希望通过映射到不同的数据集,将JSON记录写入不同的Hive表和不同的分区,这意味着:
{"code":"123", "date":"14/07/2018",....} Write to HDFS dir -> /../table123/partition=14_07_2018
{"code":"124", "date":"15/07/2018",....} Write to HDFS dir -> /../table124/partition=15_07_2018
{"code":"123", "date":"15/07/2018",....} Write to HDFS dir -> /../table123/partition=15_07_2018
{"code":"125", "date":"14/07/2018",....} Write to HDFS dir -> /../table125/partition=14_07_2018如何按代码和日期映射Json,然后按以下方式编写:
dfSelectByTableAndDate123.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate124.write().format("parquet").mode("append").save(pathByTableAndDate);
dfSelectByTableAndDate125.write().format("parquet").mode("append").save(pathByTableAndDate);谢谢
https://stackoverflow.com/questions/51380948
复制相似问题