首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >是否有方法修改此代码以减少运行时间?

是否有方法修改此代码以减少运行时间?
EN

Stack Overflow用户
提问于 2021-07-22 10:56:35
回答 2查看 650关注 0票数 1

因此,我希望修改这段代码,以减少fuzzywuzzy库的运行时。目前,800行数据集大约需要一个小时,当我在4.5K行的dataset上使用这个数据集时,它一直运行了将近6个小时,仍然没有结果。我必须阻止内核。

我至少需要在20K的数据上使用这段代码。有人能建议对这段代码进行任何修改以更快地获得结果吗?这是密码-

代码语言:javascript
复制
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz,process

df = pd.read_csv(r'path')
df.head()

data = df['Body']
print(data)

clean = []
threshold = 80 
for row in data:
  # score each sentence against each other
  # [('string', score),..]
  scores = process.extract(row, data, scorer=fuzz.token_set_ratio)
  # basic idea is if there is a close second match we want to evaluate 
  # and keep the longer of the two
  if scores[1][1] > threshold:
     clean.append(max([x[0] for x in scores[:2]],key=len))
  else:
     clean.append(scores[0][0])

# remove dupes
clean = set(clean)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean, columns=['Body'])

clean_data.to_csv(r'path') 

我的数据就是这样-

TgnHdoRf8P6gTEAkB3lQWEE/编辑?usp=共享

因此,如果您注意到第14行和第15行,第19行和第20行是部分重复的,我希望代码识别这样的句子,并删除较短的句子。

更新-

我对@Darryl G给出的rapidfuzz解决方案做了一个小改动,现在代码看起来如下-

代码语言:javascript
复制
`import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

df = pd.read_excel(r'path')

data = df['Body']
print(data)

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)


clean_rapid = []
threshold = 80 

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
        if len(scores) > 1 and scores[1][1] > threshold:
            m = max(scores[:2], key = lambda k:len(k[0]))                # Of up to two matches above threshold, takes longest
            clean_rapid.append(m[0])                                    # Saving the match index
        else:
            clean_rapid.append(query)

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe(r'path')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

# remove dupes
clean_rapid = set(clean_rapid)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean_rapid, columns=['Body'])

#exporting the cleaned data
clean_data.to_excel(r'path')`

现在的问题是,在输出文件中,所有的句号等都被删除了。我怎么才能留住他们?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-07-29 11:24:21

这回答了你问题的第二部分。processed_data保存预处理字符串,因此查询已经预处理。默认情况下,预处理是由process.extract完成的。DarrylG将此预处理移到循环前面,因此字符串不会多次预处理。如果您不希望在没有预处理的情况下对字符串进行比较,您可以直接迭代原始数据: change:

代码语言:javascript
复制
series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)   

for query in processed_data:

代码语言:javascript
复制
for query in data:

如果您想要原始行为,但想要结果中的未处理字符串,则可以使用结果字符串的索引来提取未处理的字符串。

代码语言:javascript
复制
def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query, processed_data,
            scorer=rapid_token_set_ratio,
            score_cutoff=threshold,
            limit=2)
        m = max(scores[:2], key = lambda k:len(k[0]))
        clean_rapid.append(data[m[2]])

在执行方面还有几项进一步的改进:

  1. 您可以确保不匹配当前的query,方法是在processed_data中将其替换为None,然后使用process.extractOne查找高于阈值的次之最佳匹配。这至少和process.extract一样快,而且可能要快得多。
  2. processed_data的每个元素与processed_data的每个元素进行比较。这意味着您始终同时执行比较data[n] <-> data[m]data[m] <-> data[n],即使它们保证具有相同的结果。只有执行一次比较,才能节省大约50%的运行时。
代码语言:javascript
复制
def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)
    processed_data = pd.Series(series)   

    for idx, query in enumerate(processed_data):
        # None is skipped by process.extract/extractOne, so it will never be part of the results
        processed_data[idx] = None
        match = process_rapid.extractOne(query, processed_data,
            scorer=rapid_token_set_ratio,
            score_cutoff=threshold)
        # compare the length using the original strings
        # alternatively len(match[0]) > len(query)
        # if you do want to compare the length of the processed version
        if match and len(data[match[2]]) > len(data[idx]):
            clean_rapid.append(data[match[2]])
        else:
            clean_rapid.append(data[idx])
票数 1
EN

Stack Overflow用户

发布于 2021-07-23 04:55:14

该方法以RapidFuzz为基础,以熊猫栏上的矢量化或加速模糊线匹配中的一个答案为基础。

结果

  • 操作模糊模糊方法):2565.7秒
  • RapidFuzz方法: 649.5秒

因此:4倍的改进

快速模糊实现

代码语言:javascript
复制
import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    clean_rapid = []
    threshold = 80 
    n = 0
    for query in processed_data:
        scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
        
        m = max(scores[:2], key = lambda k:len(k[0]))                # Of up to two matches above threshold, takes longest
        clean_rapid.append(m[-1])                                    # Saving the match index
        
    clean_rapid = set(clean_rapid)                                   # remove duplicate indexes

    return data[clean_rapid]                                         # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe('Duplicates1.xlsx')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

用于比较的张贴代码的版本

代码语言:javascript
复制
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
import openpyxl
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)

def process_fuzzy_wuzzy(data):
    clean = []
    threshold = 80 
   
    for idx, query in enumerate(data):
        # score each sentence against each other
        # [('string', score),..]
        scores = process.extract(query, data, scorer=fuzz.token_set_ratio)
        # basic idea is if there is a close second match we want to evaluate 
        # and keep the longer of the two
        if len(scores) > 1 and scores[1][1] > threshold:    # If second one is close
            m = max(scores[:2], key=lambda k:len(k[0]))
            clean.append(m[-1])
        else:
            clean.append(idx)

    # remove duplicates
    clean = set(clean)
    return data[clean]                                        # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
# Get DataFrame for sheet from Excel
df = excel_sheet_to_dataframe('Duplicates1.xlsx')  

# Will Process data in 'body' column of DataFrame
data = df['Body'].dropna()                                    # Dropping None rows (few None rows at end after Excel import)

# Process Data (Pandas Series)
result_fuzzy_wuzzy = process_fuzzy_wuzzy(data)
print(f'Elapsed time {time.time() - t0}')
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68483600

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档