首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >快速模糊匹配合并

快速模糊匹配合并
EN

Stack Overflow用户
提问于 2020-10-15 04:15:08
回答 1查看 189关注 0票数 2

对此非常陌生,如果有以下建议,我将不胜感激:

我有一个数据集'Projects‘,显示具有项目ID的机构列表:

代码语言:javascript
复制
project_id  institution_name
0           somali national university
1           aarhus university
2           bath spa
3           aa school of architecture
4           actionaid uk

我想将其与以下数据集“大学”及其国家代码进行模糊匹配合并:

代码语言:javascript
复制
institution_name                      country_code
a tan kapuja buddhista foiskola             HU
aa school of architecture                   UK
bath spa university                         UK
aalto-yliopisto                             FI
aarhus universitet                          DK

然后找回这个:

代码语言:javascript
复制
project_id  institution_name           Match    organisation               country_code
0           somali national university []       NaN                        NaN
1           aarhus university          [(91)]   aarhus universitet         DK
2           bath spa                   [(90)]   bath spa university        UK
3           aa school of architecture  [(100)]  aa school of architecture  UK
4           actionaid uk               []       NaN                        NaN

使用rapidfuzz:

代码语言:javascript
复制
import pandas as pd

import numpy as np

from rapidfuzz import process, utils as fuzz_utils

def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
    #   baseFrame: the left table to join
    #   compareFrame: the right table to join
    #   baseKey: key column of the left table
    #   compareKey: key column of the right table
    #   threshold: how close the matches should be to return a match, based on Levenshtein distance
    #   limit: the amount of matches that will get returned, these are sorted high to low
    #   return: dataframe with boths keys and matches
    s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}

    m1 = baseFrame[baseKey].apply(lambda x: process.extract(
      fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
    ))
    baseFrame['Match'] = m1

    m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
    baseFrame['organisation'] = m2

    return baseFrame.merge(compareFrame, on=baseKey, how=how)

Merged = fuzzy_merge(Projects, Universities, 'institution_name', 'institution_name')

Merged

我得到了这个(在match列中有一些额外的文本,但现在不会深入讨论)。这几乎就是我想要的,但是国家代码只有在100%匹配时才会匹配:

代码语言:javascript
复制
project_id  institution_name           Match    organisation               country_code
0           somali national university []       NaN                        NaN
1           aarhus university          [(91)]   aarhus universitet         NaN
2           bath spa                   [(90)]   bath spa university        NaN
3           aa school of architecture  [(100)]  aa school of architecture  UK
4           actionaid uk               []       NaN                        NaN

我认为这是我如何将我的basekey与compareframe进行比较以创建我的合并数据集的问题。不过,我不知道如何在“组织”上返回它--尝试插入它会导致不同的错误。

EN

回答 1

Stack Overflow用户

发布于 2020-10-21 04:41:04

不要紧,我想明白了--我没有考虑到空单元格!用NaN替换它们效果很好。

代码语言:javascript
复制
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
    s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}

    m1 = baseFrame[baseKey].apply(lambda x: process.extract(
      fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
    ))
    baseFrame['Match'] = m1

    m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
    baseFrame['organisations'] = m2.replace("",np.nan)

    return baseFrame.merge(compareFrame, left_on='organisations', right_on=compareKey, how=how)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64360880

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档