首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >熊猫模糊匹配

熊猫模糊匹配
EN

Stack Overflow用户
提问于 2021-08-15 13:56:47
回答 1查看 316关注 0票数 0

我想检查我的dataframe中的一列地址与另一个dataframe中的一列地址的准确性,看看它们是否匹配以及它们是否匹配。然而,似乎需要很长时间才能通过地址和执行计算。在我的主dataframe中有15000+地址,我的参考dataframe中有大约50个地址。它跑了5分钟,仍然没有完成。

我的代码是:

代码语言:javascript
复制
import pandas as pd
from fuzzywuzzy import fuzz, process

### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")

#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")

### Variable for accuracy scoring
accuracy = 0

for index, value in df["address"].iteritems():

    ### This gathers the index from the correct address column in the reference df
    ref_index = ref_df["correct_address"][
        ref_df["correct_address"]
        == process.extractOne(value, ref_df["correct_address"])[0]
    ].index.toList()[0]

    ### if each row can score a max total of 1, the ratio must be divided by 100
    accuracy += (
        fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
    )

这是最佳的方式循环通过一个列在一个数据和模糊匹配它与另一个?我希望分数是一个比率,因为稍后我将输出一个excel文件,其中包含正确的值和背景颜色,以指示哪些值是错误的,哪些值发生了更改。

我不相信fuzzywuzzy有一种方法可以让你把指数、价值和配给分成一个元组--只是值和匹配比。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-08-15 22:10:06

希望下面的代码(与虚拟数据的链接)有助于显示什么是可能的。我尝试使用街道地址来模拟类似的情况,因此与您的数据集比较更容易;显然,它没有那么大。

您可以从注释中的链接中提取csv文本,并运行它,并查看在更大的示例中可以使用什么。

对于参考框架中的五个地址和其他帧中的100个联系人,其执行时间如下:

代码语言:javascript
复制
CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms

下面的代码应该比.iteritems()更快,等等。

代码语言:javascript
复制
# %%time

import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib

# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')

#  create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')

# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
    max_score = -1
    max_add = ''
    for x in list_add:
        score = fuzz.ratio(add, x)
        if (score > min_score) & (score > max_score):
            max_add = x
            max_score = score
    return (max_add, max_score)

# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
    o = process.extractOne(x, s, score_cutoff = 60)
    if o != None:
        return o[1]
    
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())

# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}

names = []
for x in ref_addresses:
    match = match_addresses(x, contacts_addresses, 75)
    if match[1] >= 75:
        name = (str(x), str(match[0]))
        names.append(name)
name_dict = dict(names)

# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])

# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)

# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()

# add difflib scoring for a bit of interest.  
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
                                               (None, x['ref_address'], x['matched_address']).ratio(),axis=1)

# clean up column ordering ('correct_address' and 'ref_address' are basically 
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
                         'fuzzywuzzy_score', 'difflib_score']]

# see what we've got
print(compare_df)

# remember: correct_address and ref_address are copies 
# so just pick one to compare to matched_address

           correct_address              ref_address         matched_address  \
0  86 Nw 66th Street #8673  86 Nw 66th Street #8673     86 Nw 66th St #8673   
1   2737 Pistorio Rd #9230   2737 Pistorio Rd #9230  2737 Pistorio Rd #9230   
2       6649 N Blue Gum St       6649 N Blue Gum St      6649 N Blue Gum St   
3       59 n Groesbeck Hwy       59 n Groesbeck Hwy      59 N Groesbeck Hwy   
4         1 Central Avenue         1 Central Avenue           1 Central Ave   

   fuzzywuzzy_score  difflib_score  
0                90       0.904762  
1               100       1.000000  
2               100       1.000000  
3               100       0.944444  
4                90       0.896552  
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68792190

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档