+-------+--------------------+-------+
| brand| category_code| count|
+-------+--------------------+-------+
|samsung|electronics.smart...|1782386|
| apple|electronics.smart...|1649525|
| xiaomi|electronics.smart...| 924383|
| huawei|electronics.smart...| 477946|
| oppo|electronics.smart...| 242022|
|samsung|electronics.video.tv| 183988|
| apple|electronics.audio...| 165277|
| acer| computers.notebook| 154599|
| casio| electronics.clocks| 141403|在对count列执行groupBy之后,我想从category_code列中选择一个与category_code列的最大值相对应的值。因此,在category_code列中electronics.smartphone组的第一行中,我想要brand列中的字符串samsung,因为它在count列中具有最高的值...
发布于 2021-10-27 01:06:21
首先使用groupBy来标识每个category_code的最大计数行,然后与原始数据帧连接,以检索与最大计数对应的品牌值:
df1 = df.groupBy("category_code").agg(F.max("count").alias("count"))
df2 = df.join(df1, ["count", "category_code"]).drop("count")这将按如下方式生成df2
category_code brand
---------------------------
electronics.smart... samsung
electronics.video.tv samsung
electronics.audio apple
computers.notebook acer
electronics.clocks casiohttps://stackoverflow.com/questions/69726661
复制相似问题