下面是我的RDD对象的前三个元素:
[('E7750A37CAB07D0DFF0AF7E3573AC141',
0.03333333333333333,
0.44,
1.0,
0.0,
0.0,
3.5),
('778C92B26AE78A9EBDF96B49C67E4007',
0.03333333333333333,
0.71,
1.0,
0.0,
1.0,
4.0),
('BE317B986700F63C43438482792C8654',
0.03333333333333333,
0.48,
1.0,
0.0,
0.0,
4.0)]我希望通过使用string元素(如'BE317B986700F63C43438482792C8654')进行分组,并添加其余的元素。我对火种很陌生。
发布于 2022-03-31 09:49:29
我们可以接受你的意见
input=[('E7750A37CAB07D0DFF0AF7E3573AC141',0.03333333333333333,0.44,1.0,0.0,0.0,3.5),('778C92B26AE78A9EBDF96B49C67E4007',0.03333333333333333,0.71,1.0,0.0,1.0,4.0),('BE317B986700F63C43438482792C8654',0.03333333333333333,0.48,1.0,0.0,0.0,4.0)]首先,可以使用reduceByKey()函数根据group键添加元素。
但是要使用它,我们必须创建一个PairRDD,它只是一个元组的RDD,其中第一个元素始终是关键元素(尽管您可以使用keyBy函数在您的RDD上更改它)。
首先,阅读输入:
input = sc.parallelize(input) #creating an RDD在我们的输入中,第一个元素是关键。现在,我们希望将每个数字与其相关联的键放在输入中。我们想要这样的东西:
('E7750A37CAB07D0DFF0AF7E3573AC141',0.0),('E7750A37CAB07D0DFF0AF7E3573AC141',0.03333333333333333),('E7750A37CAB07D0DFF0AF7E3573AC141',0.44),('E7750A37CAB07D0DFF0AF7E3573AC141',1.0),('E7750A37CAB07D0DFF0AF7E3573AC141',0.0),('E7750A37CAB07D0DFF0AF7E3573AC141',0.0),('E7750A37CAB07D0DFF0AF7E3573AC141',3.5),('778C92B26AE78A9EBDF96B49C67E4007',0.0),('778C92B26AE78A9EBDF96B49C67E4007',0.03333333333333333),('778C92B26AE78A9EBDF96B49C67E4007',0.71),('778C92B26AE78A9EBDF96B49C67E4007',1.0),('778C92B26AE78A9EBDF96B49C67E4007',0.0),(‘778C92B26A98A9EB96B49C67E4007’,1.0),(‘778C92B26A9DF96B49C67E4007’,4.0),(‘BE317B986700F63C4344800F3448279C8654’,0.0),(‘BE317B9800F34482792C8654’,0.03333333333333333),(‘BE317B986700F63C434482792C8654’,0.48),(‘BE317B9800F3438482792C8654’,BE1.0),(‘BE317B9800F344848279C8654’,BE1.0),(‘BE317B9800F674C634482792C8654,0.48),(’BE317B9800F34484892C8654‘,0.48),(’BE317B9800F38484892C8654‘,BE317B9800F384848279C8654’,BE1.0),(‘BE317B9800F48482792C8654’,0.48),(‘BE317B9800F3438482792C8654,’BE317B9800F3438482792C8654‘,BE1.0),(’BE317B986700F676C3854,0.48),(‘BE317B9800F38482792C8654,0.48),(’BE3176800F38484892C8654‘,BE1),(’BE3179800F344862740.0),(‘BE3179800F34484892C8654’,BE1.0),(‘BE317B9800F38482792C8654,0.48),(’BE317B9800F38482792C8654‘,BE317179800F344848279854’,BE1.0),(‘
为此,我们可以使用lambda函数对每个元素(例如,RDD中的('E7750A37CAB07D0DFF0AF7E3573AC141',0.03333333333333333,0.44,1.0,0.0,0.0,3.5)) )进行迭代,并且在每个元素下面使用一个列表理解来迭代整数元素,例如,
(lambda x: [(x[0],y) for y in x])在列表理解中,我们不想用它本身创建x的元组。所以,如果其他的话,就用它来删除。
lambda x: [(x[0],y) if y != x[0] else (x[0],0.000) for y in x]现在,我们可以将其写为:
input2 = input.map(lambda x: [(x[0],y) if y != x[0] else (x[0],0.000) for y in x])
input2.collect()[('E7750A37CAB07D0DFF0AF7E3573AC141',0.0),('E7750A37CAB07D0DFF0AF7E3573AC141',0.03333333333333333),('E7750A37CAB07D0DFF0AF7E3573AC141',0.44),('E7750A37CAB07D0DFF0AF7E3573AC141',1.0),('E7750A37CAB07D0DFF0AF7E3573AC141',0.0),('E7750A37CAB07D0DFF0AF7E3573AC141',0.0),('E7750A37CAB07D0DFF0AF7E3573AC141',3.5),('778C92B26AE78A9EBDF96B49C67E4007',0.0),('778C92B26AE78A9EBDF96B49C67E4007',0.03333333333333333),('778C92B26AE78A9EBDF96B49C67E4007',0.71),('778C92B26AE78A9EBDF96B49C67E4007',1.0),('778C92B26AE78A9EBDF96B49C67E4007',0.0),(‘778C92B26A98A9EB96B49C67E4007’,1.0),(‘778C92B26A9DF96B49C67E4007’,4.0),(‘BE317B986700F63C434484892C8654’,0.0),(‘BE317B9800F34482792C8654’,0.03333333333333333),(‘BE317B986700F63C434482792C8654’,0.48),(‘BE317B9800F3438482792C8654’,BE1.0),(‘BE317B9800F344848279C8654’,BE1.0),(‘BE317B9800F34482792C8654,0.48),(’BE317B9800F38482792C8654‘,0.48),(’BE317B9800F34484892C8654‘,BE317B9800F3848482792C8654,BE1.0),(’BE317B9800F34482792C8654,0.48),(‘BE317B9800F38482792C8654,0.48),(’BE317179800F3438482792C8654‘,BE1.0),(’BE317B980000F34482792C8654,0.48),(‘BE317B9800F38482792C8654,0.48),(’BE3176800F3838482792C8654,BE1.0),(‘BE3179800F38648C3854,0.48),(’BE3179800F38484892C8654,0.48),(‘BE3179800F3848482792C8654,BE1.0),(’BE317B9800F34484892C8654,BE1.0),(‘BE317B9800F34482792C8654,0.48
在上面的输出中,我们得到了列表列表,因此我们需要将其扁平化为一个列表。
input3 = input2.flatMap(lambda x: x)
input3.collect()我们可以将所有这些放在一行中,如下所示:
input2 = input.flatMap(lambda x: [(x[0],y) if y != x[0] else (x[0],0.000) for y in x])最后,使用reduceByKey:
from operator import add
finalOutput = input2.reduceByKey(add)
finalOutput.collect()(‘778C92B26AE78A9EB96B49C67E4007,6.743333333333333),(’BE317B986700F63C4343848279C8654,5.513333333333334),(‘E7750A37CAB07D0DFF0AF7E3573AC141,4.973333333333334)
希望我的回答对你有帮助!
https://stackoverflow.com/questions/69306771
复制相似问题