我正用这个撞墙呢。Rapidfuzz提供了不同的字符串得分相似性的结果,如果我在熊猫数据框架内运行它,如果我自己运行它呢?为什么入口相似度2和最后一行的结果是不同的?
from rapidfuzz import process, utils, fuzz
import pandas as pd
import numpy as np
address_a = 'high new technology development zones huainan city anhui province china anhui anhui any city'
address_b = 'industrial park of funan city'
test_anui_data = {'Processed Client Name': ['anhui jinhan clothing co ltd'], 'Processed Aruvio Name': ['anhui jinhan clothing co ltd'], 'Processed Client Address': [address_a], 'Processed Aruvio Address': [address_b], 'Name Similarity': [89.2857142857142], 'Address Similarity': [np.nan]}
# Create DataFrame
test_anui = pd.DataFrame(test_anui_data)
test_anui
test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui['Processed Client Address']), str(test_anui['Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))发布于 2021-07-29 06:49:37
错误来自于在应用fuzz时调用整个列的事实。如果您执行以下操作,即将fuzz应用于单个行,则得到相同的结果:
test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.at[0,'Processed Client Address']), str(test_anui.at[0,'Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))或者,使用.loc
test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[0,'Processed Client Address']), str(test_anui.loc[0,'Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))dataframe中的输出是:
Processed Client Name Processed Aruvio Name \
0 anhui jinhan clothing co ltd anhui jinhan clothing co ltd
Processed Client Address \
0 high new technology development zones huainan ...
Processed Aruvio Address Name Similarity Address Similarity \
0 industrial park of funan city 89.285714 NaN
Address Similarity 2
0 28.099174 fuzz.token_sort_ratio(address_a, address_b)的名字是28.099173553719012。
换句话说,您需要指定要从哪一行提取字符串。我想您的dataframe由几行组成,这意味着您必须对每一行执行以下操作:
for i in len(test_anui):
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[i,'Processed Client Address']),
str(test_anui.loc[i,'Processed Aruvio Address']))https://stackoverflow.com/questions/68570948
复制相似问题