我的代码是这样的:
val data1 = data.withColumn("local_date_time", toLocalDateUdf('timestamp))
data1
.withColumn("year", year(col("local_date_time")))
.withColumn("month", month(col("local_date_time")))
.withColumn("day", dayofmonth(col("local_date_time")))
.withColumn("hour", hour(col("local_date_time")))
.drop("local_date_time")
.write
.mode("append")
.partitionBy("year", "month", "day", "hour")
.format("json")
.save("s3a://path/")它创建像这样的嵌套文件夹year=2020 / month=5 / day=10 is S3 (year是列名,2020是列值)。我想创建像2020 / 5 / 10这样的嵌套文件夹。如果我使用partitionBy方法,Spark会将列名添加到目录名中。
这是来自Spark的源代码:
/**
* Partitions the output by the given columns on the file system. If specified, the output is
* laid out on the file system similar to Hive's partitioning scheme. As an example, when we
* partition a dataset by year and then month, the directory layout would look like:
* <ul>
* <li>year=2016/month=01/</li>
* <li>year=2016/month=02/</li>
* </ul>
*/
@scala.annotation.varargs
def partitionBy(colNames: String*): DataFrameWriter[T] = {
this.partitioningColumns = Option(colNames)
this
}如何从目录布局中删除列名?
发布于 2020-05-19 18:04:24
.partitionBy(“年”,“月”,“日”,“小时”)
上面的命令允许您以partition=value格式将其保存到带有分区的parquet中
这不是bug,它是标准的拼接格式。
您可以遍历每个分区并手动保存它,否则
https://stackoverflow.com/questions/61813691
复制相似问题