我有这个列表,我将其转换为数据帧。
labels = ['Airport',
'Amusement',
'Bridge',
'Campus',
'Casino',
'Commercial',
'Concert',
'Convention',
'Education',
'Entertainment',
'Government',
'Hospital',
'Hotel',
'Library',
'Mall',
'Manufacturing',
'Museum',
'Residential',
'Retail',
'School',
'University',
'Theater',
'Tunnel',
'Warehouse']
labels = pd.DataFrame(labels, columns=['lookup'])
labels我有这个数据框架。
df = pd.DataFrame({'Year':[2020, 2020, 2019, 2019, 2019],
'Name':['Dulles_Airport', 'Syracuse_University', 'Reagan_Library', 'AMC Theater', 'Reagan High School']})如何根据标签中的匹配项清理df中的项?我的'labels‘是完全干净的,我的'df’是非常混乱的。我希望看到这样的df。
df = pd.DataFrame({'Year':[2020, 2020, 2019, 2019, 2019],
'Name':['Airport', 'University', 'Library', 'Theater', 'School']})
df

发布于 2021-06-25 02:40:02
您可以使用df.str.extract和nan-replacement:
labels = ['Airport', 'Amusement', 'Bridge', 'Campus', 'Casino', 'Commercial', 'Concert', 'Convention',
'Education', 'Entertainment', 'Government', 'Hospital', 'Hotel', 'Library', 'Mall', 'Manufacturing',
'Museum', 'Residential', 'Retail', 'School', 'University', 'Theater', 'Tunnel', 'Warehouse']
import pandas as pd
df = pd.DataFrame({
'Year': [2020, 2020, 2019, 2019, 2019, 1954],
'Name': ['Dulles_Airport', 'Syracuse_University', 'Reagan_Library', 'AMC Theater', 'Reagan High School', 'Shake, Rattle and Roll']
})
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")生成的DataFrame将为
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
5 1954 Shake, Rattle and Roll NaN如果要保留不匹配的单元格,请执行以下操作:
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
df.loc[df['Match'].isnull(), 'Match'] = df['Name'][df['Match'].isnull()]生成的DataFrame将为
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School
5 1954 Shake, Rattle and Roll Shake, Rattle and Roll如果要删除不匹配的单元格,请执行以下操作:
df['Match'] = df['Name'].str.extract(f"({'|'.join(labels)})")
df = df.dropna()生成的DataFrame将为
Year Name Match
0 2020 Dulles_Airport Airport
1 2020 Syracuse_University University
2 2019 Reagan_Library Library
3 2019 AMC Theater Theater
4 2019 Reagan High School School发布于 2021-06-25 02:27:50
这不是最纯粹的pandas答案,但您可以编写一个函数,根据标签列表检查字符串,并将其应用于Name列,即
def clean_labels(name):
labels = ['Airport','Amusement','Bridge','Campus',
'Casino','Commercial','Concert','Convention',
'Education','Entertainment','Government','Hospital',
'Hotel','Library','Mall','Manufacturing','Museum',
'Residential','Retail','School','University', 'Theater',
'Tunnel','Warehouse']
for item in labels:
if item in name:
return item>>> df.Name.apply(clean_labels)
0 Airport
1 University
2 Library
3 Theater
4 School我假设在这里比较字符串时没有任何拼写错误,并且它将为任何不匹配的内容返回一个NoneType。
https://stackoverflow.com/questions/68120852
复制相似问题