我有熊猫的资料如下,
loc_1 loc_2
[mumbai, gujarat, sri lanka] [chennai, UP]
[Goa, telangana] [Kashmir, Goa, Rajkot]
NaN [Bihar, Orissa]我想创建一个新的专栏,它是以上两列的结合,我确实搜索了其他类似的问题,但我面临的问题是,
当我这么做的时候
data['locations'] = data['loc_1'] + data['loc_2']
Output
--------
loc_1 loc_2 locations
[mumbai, gujarat, sri lanka] [chennai, UP] [mumbai, gujarat, sri lanka,chennai, UP]
[Goa, telangana] [Kashmir, Goa, Rajkot] [Goa, telangana,Kashmir, Goa, Rajkot]
NaN [Bihar, Orissa] NaN问题
正如您在上面看到的,有重复的值以及形成的NaN值。如何避开他们?
记住
原始数据集包含列表、str和NaN格式的值。
数据集:
loc = pd.DataFrame({
'loc_1': [['mumbai', 'gujarat', 'sri lanka'],['Goa', 'telangana'],np.nan],
'loc_2':[['chennai','UP'],['kashmir','goa','rajkot'],['bihar','orissa']],
'loc_3':['Chennai','Bangalore','Vizag']
})发布于 2022-01-07 07:05:54
首先,用替换NaNs (floats)连接值,使其为空列表:
data['locations'] = data['loc_1'].apply(lambda x: [] if isinstance(x, float) else x) + data['loc_2']然后通过dict.fromkeys转换成字典,删除与原版相同顺序的副本。
data['locations'] = data['locations'].apply(lambda x: list(dict.fromkeys(x)))如果订单不重要,可以使用一组:
data['locations'] = data['locations'].apply(lambda x: list(set(x)))发布于 2022-01-07 07:19:34
如果使用loc.fillna("", inplace=True),那么包含空值的添加就不会再产生NaNs了。
若要从包含列表的列中筛选重复项,请使用:
loc['locations'] = loc['locations'].apply(lambda locs: list(set(locs)))发布于 2022-01-07 07:35:36
对清单的理解可以很快。让我们试试
data['location']=[list(set([] if isinstance(x, float) else x).union(set(y))) for x,y in zip (data['loc_1'],data['loc_2'])]
loc_1 loc_2 loc_3 \
0 [mumbai, gujarat, sri lanka] [chennai, UP, sri lanka] Chennai
1 [Goa, telangana] [kashmir, goa, rajkot] Bangalore
2 NaN [bihar, orissa] Vizag
location
0 [chennai, UP, gujarat, mumbai, sri lanka]
1 [telangana, rajkot, goa, kashmir, Goa]
2 [bihar, orissa] https://stackoverflow.com/questions/70617623
复制相似问题