首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >长句的模糊匹配

长句的模糊匹配
EN

Stack Overflow用户
提问于 2021-06-01 23:16:05
回答 2查看 94关注 0票数 1

假设我有以下数据帧:

代码语言:javascript
复制
ID       CompanyName         JobDescription
1        Green Grass LLC     "In the centre of Green Grass area..."
2        Johnny Inc.          "Johnny is currently looking for data analist that..."
3        Liamloy             "LiamLoy Corp. is established in New York..."
4        KaasKan             "In the forest we are walking..."

我的主要目标是在每个JobDescription中排除CompanyName。期望的输出将是:

代码语言:javascript
复制
ID       CompanyName         JobDescription
1        Green Grass LLC     "In the centre of area..."
2        Johnny Inc.          "is currently looking for data analist that..."
3        Liamloy             "is established in New York..."
4        KaasKan             "In the forest we are walking"

我尝试对JobDescription进行word tokenize (将句子转换为单词),并应用fuzzymatching来检测和删除匹配项。然而,这并不是很成功。例如,当标记第三个JobDescription时。"Liamloy“与"LiamLoy”和"Corp.“作比较。也许这种方法并不理想。在这一点上我不知道。我想知道你们中是否有人愿意分享他们的观点,并告诉我如何成功地删除每个JobDescription中的CompanyName

EN

回答 2

Stack Overflow用户

发布于 2021-06-07 21:39:42

如果您不希望公司名称中的单词互换,我建议使用内置的python库difflib来查找两个字符串的公共部分,并将其替换为掩码。

代码语言:javascript
复制
def find_matching_spans(a, b, min_match=3, max_mismatch=1):
    """ Find the spans in the string b that are similar to the string a"""
    prev_match = 0
    match = 0
    mismatch = 0
    i = 0
    span_start = 0
    prev_start = 0
    span_end = 0
    spans = []
    common = []
    
    def add_span():
        if prev_match > min_match:
            if spans and spans[-1][-1] >= prev_start - 2:
                spans[-1][-1] = span_end
            else:
                spans.append([prev_start, span_end])
    
    for item in difflib.ndiff(a.lower(), b.lower()):
        if item[0] == ' ' and item[2] != ' ':
            match += 1
            mismatch = 0
            if match == 1:
                span_start = i
                common = []
            common.append(item[2])
        elif item[0] == '+' or item[2] == ' ':
            if match > min_match:
                add_span()
                prev_start = span_start
                prev_match = match
                span_end = i
            match = 0
            mismatch += 1
            if mismatch > max_mismatch:
                add_span()
                prev_match = 0
        elif item[0] == '-':
            pass
        if item[0] in {' ', '+'}:
            i += 1
    return spans


def replace_spans(text, spans, replacement):
    spans = [[0, 0]] + spans + [[len(text), len(text)]]
    parts = []
    for i in range(1, len(spans)):
        parts.append(text[spans[i-1][1]:spans[i][0]])
        if i < len(spans) - 1:
            parts.append('XXX')
    return ''.join(parts)


def replace_name(a, b, replacement='XXX'):
    b_prev = None
    while b_prev != b:
        spans = find_matching_spans(a, b)
        b_prev = b
        b = replace_spans(b, spans, replacement)
    return b

它的工作原理如下:

代码语言:javascript
复制
print(replace_name("Green Grass LLC", "In the centre of Green Grass area..."))
print(replace_name("Johnny Inc.", "Johnny is currently looking for data analist that..."))
print(replace_name("Liamloy", "LiamLoy Corp. is established in New York..."))
print(replace_name("KaasKan", "In the forest we are walking..."))

并产生输出

代码语言:javascript
复制
In the centre of XXX area...
XXX is currently looking for data analist that...
XXX Corp. is established in New York...
In the forest we are walking...
票数 1
EN

Stack Overflow用户

发布于 2021-06-10 13:52:30

为什么不使用regex呢?

代码语言:javascript
复制
import re


def replace_company_name(company_name, text):
    sanitized_text = re.sub(company_name, '', text)
    return sanitized_text

由于Liamloy的例子,听起来你还需要考虑公司名称的后缀,比如corp。

解决这个问题的一种方法是使用一组通用的公司名称后缀常量。您还应该注意到,我使用了忽略大小写标志,因为查看LiamLoy的行时,公司名称是Liamloy,而在职位描述中是LiamLoy。在后修复的大写方式上也可能存在差异(INC,Inc,inc等)

代码语言:javascript
复制
COMPANY_NAME_POSTFIXES = '|'.join(['INC', 'CORP', 'LLC', 'LTD'])


def replace_company_name(company_name, text):

    # 1. replace any postfixes in the company name. E.G. Green Grass LLC. -> Green Grass 
    company_name_post_fixregex = rf'({COMPANY_NAME_POSTFIXES})?\\.?'
    sanitized_company_name = re.sub(company_name_postfix_regex, '', company_name, flags=re.IGNORECASE)
    # 2. replace any instances of the sanitized company name followed optionally by both a space and a company name postfix
    search_string = rf'{sanitized_company_name}\\s?{company_name_postfix_regex}'
    sanitized_text = re.sub(search_string, '', text, flags=re.IGNORECASE)
    return sanitized_text

如果单词不是公司名称,上述方法将导致替换正在使用的单词实例的副作用。例如,绿草有限责任公司“在绿草区域的中心有很多被照顾的绿草”在区域的中心有很多被照顾的->

如果不希望出现这种副作用,则需要清理公司名称的大写形式的工作描述,或者计算并传入公司名称数组。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67791573

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档