我正在读一篇关于星火联接中数据偏斜的有趣文章。有一个例子,在数据集和调用的连接中都重命名了join列。作者声称这是可行的,但我不明白为什么它会工作,与前面的示例相比,连接是在不重新命名的情况下执行的。这篇文章是加入火种的艺术。
该条的相关代码如下:
// The following row avoids the broadcasting, the dimension_table2 is very small
spark.conf.set("spark.sql.autoBroadcastJoinThreshold",-1)
// I'm using caching to simplify the DAG
dimension_table2.cache
dimension_table2.count
// One way to use the same partitioner is to partition on a column with the same name,
// let's rename the columns that we want to join
fact_table = fact_table.withColumnRenamed("dimension_2_key", "repartition_id")
dimension_table2 = dimension_table2.withColumnRenamed("id", "repartition_id")
fact_table = fact_table.repartition(400, fact_table.col("repartition_id"))
fact_table = fact_table.join(dimension_table2.repartition(400, dimension_table2.col("repartition_id")),
fact_table.col("repartition_id") === dimension_table2.col("repartition_id"), "left")
fact_table.count发布于 2020-06-11 16:00:47
我上面提到的那篇文章是不正确的。我没有看到与重命名的列有任何不同。
https://stackoverflow.com/questions/60780466
复制相似问题