我是pyspark的初学者,所以我无法解决问题。我有一个RDD,如下所示
results = [('alice', 'charlie'), ('charlie', 'alice'), ('charlie', 'doris'),('doris', 'charlie')]
result = sc.parallelize(result)
result.collect()
[('charlie', 'doris'),
('charlie', 'alice'),
('doris', 'charlie'),
('alice', 'charlie')]我想为每一行排序key和val,o/p如下所示
[('charlie', 'doris'),
('alice', 'charlie'),
('charlie', 'doris'),
('alice', 'charlie')]发布于 2019-08-18 21:05:35
您需要将元组列表转换为列表列表,因为元组在python中是不可变的数据类型。在此之后,您可以对每个嵌套列表进行排序(我已经展示了如何根据您的需求进行排序)。然后将嵌套列表转换为元组列表。
level1= [('charlie', 'doris'),
('charlie', 'alice'),
('doris', 'charlie'),
('alice', 'charlie')]
#Convert tuple to list
level1 = map(list,level1) #Converting tuples to list of lists(nested lists)
level2=list((level1)) #Here we have list of lists.
#Sort elements of the nested list
Output = [sorted(x, key = lambda x:x[0]) for x in level2]
#Convert the list of lists to list of tuples
nested_lst_of_tuples = [tuple(l) for l in Output]
print(nested_lst_of_tuples)上面的代码给出了输出
[('charlie', 'doris'), ('alice', 'charlie'), ('charlie', 'doris'), ('alice', 'charlie')]https://stackoverflow.com/questions/57544346
复制相似问题