我有两个数据帧。
第一个df由记录组成,其中包含对每个记录进行分类的术语列表,以及我试图为其赋值的“行业”列。
df1 = pd.DataFrame([['a', ['Elementary', 'Pre Schools', 'High Schools'], None], ['b', ['Museums'], None], ['c', ['Junior High Schools', 'Military - Misc.', 'Roads', 'Sewers and Water Mains'], None]], columns=['Record Name', 'Terms', 'Industry'])df1看起来像
Record Name Terms Industry
0 a [Elementary, Pre Schools, High Schools] None
1 b [Museums] None
2 c [Junior High Schools, Military - Misc., Roads, Sewers and Water Mains] None第二个df提供了术语及其对应的“行业”的列表。
df2 = pd.DataFrame([['Arts', ['Libraries', 'Museums', 'Auditoriums', 'Entertainment']], ['Education', ['Elementary', 'Pre Schools', 'College', 'University', 'Junior High Schools', 'High Schools']]])d2看起来像
Industry Terms
0 Arts [Libraries, Museums, Auditoriums, Entertainment]
1 Education [Elementary, Pre Schools, College, University, Junior High Schools, High Schools]df2包含10个不同的行业,大约100个术语分布在它们之间。并不是df1中的每一个术语都用来对df2中的行业进行分类。
如何根据df1和df2之间的任何公共'Term‘列表元素在df1中分配’行业‘值?以下是所需的输出:
Record Name Terms Industry
0 a [Elementary, Pre Schools, High Schools] Education
1 b [Museums] Arts
2 c [Junior High Schools, Military - Misc., Roads, Sewers and Water Mains] Education 我已经尝试使用每个行业及其相应术语的一系列列表,以便可以使用.isin()进行分配
df1.loc[df1['Terms'].isin(Education), 'Industry'] = 'Education'但这只会将“行业”分配给所有“术语”都出现在“教育”列表中的记录。此外,由于有10个不同的行业,为每个行业定义列表和创建单独的工作分配将是繁琐的。我认为评估两个数据帧之间的公共“术语”将是一种更干净的方法。
发布于 2020-10-28 23:19:04
我发现很难处理数据帧中的列表,所以我首先分解列表,然后合并数据帧并将列表放回一起。
#This just helps me keep the columns straight
df2.columns = ['Industry', 'Terms']
# Drop Industry in df1 and explode the lists in df1 & df2
df1 = df1.drop('Industry', axis=1).explode('Terms')
df2 = df2.explode('Terms')
# Merge the terms back into df1
df1 = df1.merge(df2, how='left', on='Terms')
# Bring the lists back together
df1 = df1.groupby('Record Name').agg(lambda x: x.dropna().unique().tolist())结果是:
Terms Industry
Record Name
a [Elementary, Pre Schools, High Schools] [Education]
b [Museums] [Arts]
c [Junior High Schools, Military - Misc., Roads,... [Education]https://stackoverflow.com/questions/64575370
复制相似问题