我有帖子数据帧和作者数据帧
我需要计算每天的帖子作者的favCount
posts_columns = ["postId", "authorId", "date"]
posts = [("1", "1", "2020-10-10"),
("2", "2", "2020-10-10"),
("3", "2", "2020-10-10"),
("4", "2", "2020-10-11")
("5", "3", "2020-10-11")
("6", "3", "2020-10-11")]
authors_columns = ["authorId", "favCount"]
authors = [
("1", "5"),
("2", "3"),
("3", "12")]在(内部)加入帖子和作者在(posts.authorId=author.authorId)上的数据帧后,我得到了这个
+----------+--------+---------------+
| date|authorId| favCount|
+----------+--------+---------------+
|2020-10-10| 1| 5|
|2020-10-10| 2| 3|
|2020-10-10| 2| 3|
|2020-10-11| 2| 3|
|2020-10-11| 3| 12|
|2020-10-11| 3| 12|
+----------+--------+---------------+现在我想计算每天作者的favCount和,最终结果应该是这样的
+----------+-------------+
| date|sum(favCount)|
+----------+-------------+
|2020-10-10| 8|
|2020-10-11| 15|
+----------+-------------+10月10日,我有两个作者(1和2),总共有8个favCount (5+3)
10月11日,我有两位作者(2位和3位),总共有15个favCount (3+12)
附言:我不想计算重复作者的favCount,每个作者的favCount应该每天只计算一次
附言(2):我正在使用PySpark和Dataframe,但我不介意用Pandas甚至SQL语言来回答
发布于 2021-09-16 03:29:16
将df1视为帖子,将df2视为作者数据帧
result = df1.merge(df2, how= 'inner').drop_duplicates(subset=['date','authorId'])
final = result.groupby([result.date])['favCount'].sum()发布于 2021-09-16 05:41:32
如果你想试试spark,你可以试试这个
scala代码:
df1.join(df2, Seq("authorId"), "inner").groupBy("date", "authorId").sum()或者python:
df1.join(df2, ["authorId"], "inner").groupBy("date", "authorId").sum()https://stackoverflow.com/questions/69201630
复制相似问题