我有4个列,它们是BuisnessID、Name、BuisnessID_y、Name_y和我希望用90%的相似性分数将名称与Name_y匹配,如果不是90%,则删除这些行。样本输入
df
BusinessID NAME BusinessID_y NAME_y
1013120869 MANOJ WANKHADE 1013404164 SLIMI
1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL
1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR我对python并不熟悉,也不知道如何做到这一点。另外,我有500 k的记录,所以任何其他快速毛绒的方法都会很棒。
发布于 2021-12-10 09:33:24
>>> import pandas as pd
>>> import rapidfuzz
>>> df['matching_ratio'] = df.apply(lambda x:rapidfuzz.fuzz.ratio(x.NAME, x.NAME_y), axis=1).to_list()
>>> df
BusinessID NAME BusinessID_y NAME_y matching_ratio
0 1013120869 MANOJ WANKHADE 1013404164 SLIMI 10.526316
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
2 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
3 1013120869 MANOJ WANKHADE 1013376009 PRATHMESH AGRAWAL 25.806452
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677
>>> df[df.matching_ratio > 26] # change this '26' value to '90' as your requirmetn
BusinessID NAME BusinessID_y NAME_y matching_ratio
1 1013120869 MANOJ WANKHADE 1013831688 AMOL SHAHAKAR 44.444444
4 1013120869 MANOJ WANKHADE 1013478922 AMBRISH PANDRIKAR 38.709677https://stackoverflow.com/questions/70302216
复制相似问题