我想比较一下这个数据格式df1
Product Price
0 Waterproof Liner 40
1 Phone Tripod 50
2 Waterproof Pants 0
3 baby Kids play Mat 985
4 Hiking BACKPACKS 34
5 security Camera 160使用df2,如下所示:
Product Id
0 Home Security IP Camera 508760
1 Hiking Backpacks – Spring Products 287950
2 Waterproof Eyebrow Liner 678897
3 Waterproof Pants – Winter Product 987340
4 Baby Kids Water Play Mat – Summer Product 111500我想比较一下 df1中的产品列和Productdf2。为了找到产品的好id。如果有相似性< 80,它会将'Remove'放在ID字段NB:中,df1和df2中的产品列的文本不是100%匹配的,有人能帮我解决这个问题,或者如何使用模糊蜡笔来获得好的id?
这里是我的代码
import pandas as pd
from fuzzywuzzy import process
data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}
data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
'Id': [508760,287950,678897,987340,111500],}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
.tolist(), columns=['Product1',"match_comp", "Id"])我得到的是:
Product1 match_comp Id
0 Waterproof Eyebrow Liner 86 2
1 Waterproof Eyebrow Liner 50 2
2 Waterproof Pants – Winter Product 90 3
3 Baby Kids Water Play Mat – Summer Product 86 4
4 Hiking Backpacks – Spring Products 90 1
5 Home Security IP Camera 86 0预期的内容:
Product Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760发布于 2021-07-27 11:28:52
您可以创建一个包装器函数:
def extract(s):
name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
if score < 80:
return 'Remove'
return df2.set_index('Product2').loc[name, 'Id']
df1['ID'] = df1["Product1"].apply(extract)产出:
Product1 Price ID
0 Waterproof Liner 40 678897
1 Phone Tripod 50 Remove
2 Waterproof Pants 0 987340
3 baby Kids play Mat 985 111500
4 Hiking BACKPACKS 34 287950
5 security Camera 160 508760注意:输出与您预期的不完全相同,您必须解释为什么要删除行4/5。
https://stackoverflow.com/questions/68543759
复制相似问题