我的数据包括四列,图表类型,歌曲名称,歌曲在图表中的位置,以及歌曲在图表中的特定位置。我怎样才能在图表的第一位找到每首歌的总天数?我想让我的结果看起来像: chart_type,歌,在第一天
首先,我过滤图表的位置,我只保留#1。接下来我应该做什么?ReduceByKey表示歌曲,然后减少图表类型,然后计数记录,以便为每个图表类型中的每首歌曲找到#1中的总天数?
('top200', '501', '1', '2021-03-26T00:00:00.000+02:00')
('top200', '501', '1', '2021-03-27T00:00:00.000+02:00')
('top200', '501', '1', '2021-03-28T00:00:00.000+02:00')
('viral50', 'Gowtu', '1', '2017-03-17T00:00:00.000+02:00')
('viral50', 'Gowtu', '1', '2017-03-18T00:00:00.000+02:00')
('viral50', 'Gowtu', '1', '2017-03-19T00:00:00.000+02:00')
('top200', 'Lonely (with benny blanco)', '1', '2020-11-09T00:00:00.000+02:00')
('top200', 'Lonely (with benny blanco)', '1', '2020-11-10T00:00:00.000+02:00')
('top200', 'Lonely (with benny blanco)', '1', '2020-11-11T00:00:00.000+02:00')谢谢
发布于 2022-07-17 15:24:52
如果您想使用rdd进行此操作,并且需要按前两个元素分组计数,则可以执行以下操作。
data_ls = [
('top200', '501', '1', '2021-03-26T00:00:00.000+02:00'),
('top200', '501', '1', '2021-03-27T00:00:00.000+02:00'),
('top200', '501', '1', '2021-03-28T00:00:00.000+02:00'),
('viral50', 'Gowtu', '1', '2017-03-17T00:00:00.000+02:00'),
('viral50', 'Gowtu', '1', '2017-03-18T00:00:00.000+02:00'),
('viral50', 'Gowtu', '1', '2017-03-19T00:00:00.000+02:00'),
('top200', 'Lonely (with benny blanco)', '1', '2020-11-09T00:00:00.000+02:00'),
('top200', 'Lonely (with benny blanco)', '1', '2020-11-10T00:00:00.000+02:00'),
('top200', 'Lonely (with benny blanco)', '1', '2020-11-11T00:00:00.000+02:00')
]
data_rdd = spark.sparkContext.parallelize(data_ls)
from operator import add
data_rdd. \
map(lambda gk: ((gk[0], gk[1]), 1)). \
reduceByKey(add). \
collect()
# [(('top200', '501'), 3),
# (('top200', 'Lonely (with benny blanco)'), 3),
# (('viral50', 'Gowtu'), 3)]https://stackoverflow.com/questions/73012861
复制相似问题