我有下面的代码片段,我对"orderBy“和"partitionBy”的执行顺序有点困惑。
MY_DATA_FRAME.orderBy(ORDER_BY_FIELD).coalesce(NUM_OF_PARTITIONS).write.format("parquet").option("compression", "zip").partitionBy(PARTITION_BY_FIELD).option("path",LOCATION).save(FILE_NAME)
我可以知道在partitionBy之后,然后写到一个输出文件,这个输出文件是否仍然满足按ORDER_BY_FIELD排序
谢谢。
发布于 2020-06-10 14:07:57
在spark physical plan order by**,之后保存分区文件时,在查看order by 时似乎没有执行任何排序操作。因此,我认为,**中指定的行的排序应该保持
spark.sql(
"""
|CREATE TABLE IF NOT EXISTS data_source_tab1 (col1 INT, p1 STRING, p2 STRING)
| USING PARQUET PARTITIONED BY (p1, p2)
""".stripMargin).show(false)
val table = spark.sql("select p2, col1 from values ('bob', 1), ('sam', 2), ('bob', 1) T(p2,col1)")
table.createOrReplaceTempView("table")
spark.sql(
"""
|INSERT INTO data_source_tab1 PARTITION (p1 = 'part1', p2)
| SELECT p2, col1 FROM table order by col1
""".stripMargin).explain(true)物理计划-
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand file:/.../spark-warehouse/data_source_tab1, Map(p1 -> part1), false, [p1#14, p2#13], Parquet, Map(path -> file:/.../spark-warehouse/data_source_tab1), Append, CatalogTable(
Database: default
Table: data_source_tab1
Created Time: Wed Jun 10 11:25:12 IST 2020
Last Access: Thu Jan 01 05:29:59 IST 1970
Created By: Spark 2.4.5
Type: MANAGED
Provider: PARQUET
Location: file:/.../spark-warehouse/data_source_tab1
Partition Provider: Catalog
Partition Columns: [`p1`, `p2`]
Schema: root
-- col1: integer (nullable = true)
-- p1: string (nullable = true)
-- p2: string (nullable = true)
), org.apache.spark.sql.execution.datasources.CatalogFileIndex@bbb7b43b, [col1, p1, p2]
+- *(1) Project [cast(p2#1 as int) AS col1#12, part1 AS p1#14, cast(col1#2 as string) AS p2#13]
+- *(1) Sort [col1#2 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(col1#2 ASC NULLS FIRST, 2)
+- LocalTableScan [p2#1, col1#2]https://stackoverflow.com/questions/62292756
复制相似问题