文章/答案/技术大牛

发布

问熊猫模糊匹配
EN

Stack Overflow用户

提问于 2021-08-15 13:56:47

回答 1查看 316关注 0票数 0

我想检查我的dataframe中的一列地址与另一个dataframe中的一列地址的准确性，看看它们是否匹配以及它们是否匹配。然而，似乎需要很长时间才能通过地址和执行计算。在我的主dataframe中有15000+地址，我的参考dataframe中有大约50个地址。它跑了5分钟，仍然没有完成。

我的代码是：

import pandas as pd
from fuzzywuzzy import fuzz, process

### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")

#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")

### Variable for accuracy scoring
accuracy = 0

for index, value in df["address"].iteritems():

    ### This gathers the index from the correct address column in the reference df
    ref_index = ref_df["correct_address"][
        ref_df["correct_address"]
        == process.extractOne(value, ref_df["correct_address"])[0]
    ].index.toList()[0]

    ### if each row can score a max total of 1, the ratio must be divided by 100
    accuracy += (
        fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
    )

这是最佳的方式循环通过一个列在一个数据和模糊匹配它与另一个？我希望分数是一个比率，因为稍后我将输出一个excel文件，其中包含正确的值和背景颜色，以指示哪些值是错误的，哪些值发生了更改。

我不相信fuzzywuzzy有一种方法可以让你把指数、价值和配给分成一个元组--只是值和匹配比。

python

pandas

dataframe

iteration

fuzzywuzzy

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-08-15 22:10:06

希望下面的代码(与虚拟数据的链接)有助于显示什么是可能的。我尝试使用街道地址来模拟类似的情况，因此与您的数据集比较更容易；显然，它没有那么大。

您可以从注释中的链接中提取csv文本，并运行它，并查看在更大的示例中可以使用什么。

对于参考框架中的五个地址和其他帧中的100个联系人，其执行时间如下：

CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms

下面的代码应该比.iteritems()更快，等等。

码

# %%time

import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib

# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')

#  create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')

# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
    max_score = -1
    max_add = ''
    for x in list_add:
        score = fuzz.ratio(add, x)
        if (score > min_score) & (score > max_score):
            max_add = x
            max_score = score
    return (max_add, max_score)

# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
    o = process.extractOne(x, s, score_cutoff = 60)
    if o != None:
        return o[1]
    
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())

# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}

names = []
for x in ref_addresses:
    match = match_addresses(x, contacts_addresses, 75)
    if match[1] >= 75:
        name = (str(x), str(match[0]))
        names.append(name)
name_dict = dict(names)

# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])

# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)

# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()

# add difflib scoring for a bit of interest.  
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher\
                                               (None, x['ref_address'], x['matched_address']).ratio(),axis=1)

# clean up column ordering ('correct_address' and 'ref_address' are basically 
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',\
                         'fuzzywuzzy_score', 'difflib_score']]

# see what we've got
print(compare_df)

# remember: correct_address and ref_address are copies 
# so just pick one to compare to matched_address

           correct_address              ref_address         matched_address  \
0  86 Nw 66th Street #8673  86 Nw 66th Street #8673     86 Nw 66th St #8673   
1   2737 Pistorio Rd #9230   2737 Pistorio Rd #9230  2737 Pistorio Rd #9230   
2       6649 N Blue Gum St       6649 N Blue Gum St      6649 N Blue Gum St   
3       59 n Groesbeck Hwy       59 n Groesbeck Hwy      59 N Groesbeck Hwy   
4         1 Central Avenue         1 Central Avenue           1 Central Ave   

   fuzzywuzzy_score  difflib_score  
0                90       0.904762  
1               100       1.000000  
2               100       1.000000  
3               100       0.944444  
4                90       0.896552

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68792190

复制

相似问题

问熊猫模糊匹配
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫模糊匹配EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫模糊匹配
EN