问基于上一行/当前行的Pyspark排名
EN

Stack Overflow用户

提问于 2020-02-23 11:21:23

回答 1查看 82关注 0票数 1

下面是我的源和预期输出数据帧

如果前一行(Hdr) ==当前行(Hdr)和上一行(Dtl) ==当前行(dtl)，我需要应用以下逻辑并计算最终排名值，

然后赋值前一行排名，否则前一行排名+1

我不能推进帖子密集排名。你能分享一下你的观点吗？考虑到潜在的性能开销，我正在尝试避免没有partitionBy列的窗口

sample = [(100,1000),(100, 1000), (100, 2000), (200, 1000), (200,1000), (300,1000), (300,2000)]
test = spark.createDataFrame(sample,['hdr','dtl'])
spec = Window.partitionBy('hdr').orderBy('hdr','dtl')
test.withColumn('dense', func.dense_rank().over(spec)).show()

pyspark

回答 1

Stack Overflow用户

发布于 2020-02-23 14:43:50

我不认为没有窗口的排名是可能的，在你的例子中，因为排名需要发生在整个数据集上没有partitionBy是不可能避免窗口函数的，但是我们可以用下面的代码减少进入一个分区的大量数据。

sample = [(100,1000),(100, 1000), (100, 2000), (200, 1000), (200,1000), (300,1000), (300,2000)]
test = spark.createDataFrame(sample,['hdr','dtl'])

# Since we select only distinct of hdr and dtl huge amount of data is eliminated.
dist_hdr_dtl=test.select("hdr","dtl").distinct()

# Since data size is reduced we can use this window spec.
spec = Window.orderBy('hdr','dtl')
dist_hdr_dtl=dist_hdr_dtl.withColumn('final_rank', dense_rank().over(spec))

# join it with original data to get the ranks.
Note: if distinct dataset is not very huge you can use broadcast join which will improve performance
test.join(dist_hdr_dtl,["hdr","dtl"],"inner").orderBy('hdr','dtl').show()

+---+----+----------+
|hdr| dtl|final_rank|
+---+----+----------+
|100|1000|         1|
|100|1000|         1|
|100|2000|         2|
|200|1000|         3|
|200|1000|         3|
|300|1000|         4|
|300|2000|         5|
+---+----+----------+

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/60358910

复制

相似问题

问基于上一行/当前行的Pyspark排名
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于上一行/当前行的Pyspark排名EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于上一行/当前行的Pyspark排名
EN