首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >失配值的Pandas级数和NAN值

失配值的Pandas级数和NAN值
EN

Stack Overflow用户
提问于 2021-03-30 21:34:15
回答 2查看 66关注 0票数 2

我有这两本字典,

代码语言:javascript
复制
dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
        "Age": ["20","18","62","73",'21','20'],
        "Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
     
             }
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}

dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)

在其中,我希望匹配Surname列,然后将其附加到Name列,最后将其附加到dico,以获得以下输出:

代码语言:javascript
复制
      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology
3  Patrice  Nan       73     Medical
4    Zadig  Nan       21      Cinema
5    Sacha  Nan       20          CS

并最终删除Surname为Nan的行

代码语言:javascript
复制
      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology
代码语言:javascript
复制
map_list = []
for name in dico['Name']:
    best_ratio = None
    for idx, surname in enumerate(dico2['Surname']):
        if best_ratio == None:
            best_ratio = fuzz.ratio(name, surname)
            best_idx = 0
        else:
            ratio = fuzz.ratio(name, surname)
            if  ratio > best_ratio:
                best_ratio = ratio
                best_idx = idx
    map_list.append(dico2['Surname'][best_idx]) # obtain surname

dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns

#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)

但当I print(dico)时,输出如下所示:

代码语言:javascript
复制
      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology
3  Patrice  Patrice4  73     Medical
4    Zadig  Patrice4  21      Cinema
5    Sacha  Patrice4  20          CS

我不明白为什么在帕特里斯行之后,会有一个不匹配,而我希望它是"Nan“。

EN

回答 2

Stack Overflow用户

发布于 2021-03-30 21:54:48

您可以执行以下操作。定义函数:

代码语言:javascript
复制
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    s = df_2[key2].tolist()
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['Surname'] = m
    m2 = df_1['Surname'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['Surname'] = m2
    return df_1

然后运行

代码语言:javascript
复制
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df = fuzzy_merge(dico, dico2, 'Name', 'Surname',threshold=90, limit=2)

这将返回:

代码语言:javascript
复制
Name Age     Studies   Surname
0   Arthur  20   Economics   Arthur1
1    Henri  18       Maths    Henri2
2  Lisiane  62  Psychology  Lisiane3
3  Patrice  73     Medical  Patrice4
4    Zadig  21      Cinema          
5    Sacha  20          CS       
票数 2
EN

Stack Overflow用户

发布于 2021-03-30 22:17:40

让我们尝试pd.Multiindex.from_product来创建组合,然后使用zipfuzz.ratio分配分数,并使用一些过滤来创建我们的字典,然后我们可以使用series.mapdf.dropna

代码语言:javascript
复制
from fuzzywuzzy import fuzz

comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])

代码语言:javascript
复制
print(out)

      Name Age     Studies   SurName
0   Arthur  20   Economics   Arthur1
1    Henri  18       Maths    Henri2
2  Lisiane  62  Psychology  Lisiane3
3  Patrice  73     Medical  Patrice4
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/66871976

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档