假设我有以下数据帧:
ID CompanyName JobDescription
1 Green Grass LLC "In the centre of Green Grass area..."
2 Johnny Inc. "Johnny is currently looking for data analist that..."
3 Liamloy "LiamLoy Corp. is established in New York..."
4 KaasKan "In the forest we are walking..."我的主要目标是在每个JobDescription中排除CompanyName。期望的输出将是:
ID CompanyName JobDescription
1 Green Grass LLC "In the centre of area..."
2 Johnny Inc. "is currently looking for data analist that..."
3 Liamloy "is established in New York..."
4 KaasKan "In the forest we are walking"我尝试对JobDescription进行word tokenize (将句子转换为单词),并应用fuzzymatching来检测和删除匹配项。然而,这并不是很成功。例如,当标记第三个JobDescription时。"Liamloy“与"LiamLoy”和"Corp.“作比较。也许这种方法并不理想。在这一点上我不知道。我想知道你们中是否有人愿意分享他们的观点,并告诉我如何成功地删除每个JobDescription中的CompanyName。
发布于 2021-06-07 21:39:42
如果您不希望公司名称中的单词互换,我建议使用内置的python库difflib来查找两个字符串的公共部分,并将其替换为掩码。
def find_matching_spans(a, b, min_match=3, max_mismatch=1):
""" Find the spans in the string b that are similar to the string a"""
prev_match = 0
match = 0
mismatch = 0
i = 0
span_start = 0
prev_start = 0
span_end = 0
spans = []
common = []
def add_span():
if prev_match > min_match:
if spans and spans[-1][-1] >= prev_start - 2:
spans[-1][-1] = span_end
else:
spans.append([prev_start, span_end])
for item in difflib.ndiff(a.lower(), b.lower()):
if item[0] == ' ' and item[2] != ' ':
match += 1
mismatch = 0
if match == 1:
span_start = i
common = []
common.append(item[2])
elif item[0] == '+' or item[2] == ' ':
if match > min_match:
add_span()
prev_start = span_start
prev_match = match
span_end = i
match = 0
mismatch += 1
if mismatch > max_mismatch:
add_span()
prev_match = 0
elif item[0] == '-':
pass
if item[0] in {' ', '+'}:
i += 1
return spans
def replace_spans(text, spans, replacement):
spans = [[0, 0]] + spans + [[len(text), len(text)]]
parts = []
for i in range(1, len(spans)):
parts.append(text[spans[i-1][1]:spans[i][0]])
if i < len(spans) - 1:
parts.append('XXX')
return ''.join(parts)
def replace_name(a, b, replacement='XXX'):
b_prev = None
while b_prev != b:
spans = find_matching_spans(a, b)
b_prev = b
b = replace_spans(b, spans, replacement)
return b它的工作原理如下:
print(replace_name("Green Grass LLC", "In the centre of Green Grass area..."))
print(replace_name("Johnny Inc.", "Johnny is currently looking for data analist that..."))
print(replace_name("Liamloy", "LiamLoy Corp. is established in New York..."))
print(replace_name("KaasKan", "In the forest we are walking..."))并产生输出
In the centre of XXX area...
XXX is currently looking for data analist that...
XXX Corp. is established in New York...
In the forest we are walking...发布于 2021-06-10 13:52:30
为什么不使用regex呢?
import re
def replace_company_name(company_name, text):
sanitized_text = re.sub(company_name, '', text)
return sanitized_text由于Liamloy的例子,听起来你还需要考虑公司名称的后缀,比如corp。
解决这个问题的一种方法是使用一组通用的公司名称后缀常量。您还应该注意到,我使用了忽略大小写标志,因为查看LiamLoy的行时,公司名称是Liamloy,而在职位描述中是LiamLoy。在后修复的大写方式上也可能存在差异(INC,Inc,inc等)
COMPANY_NAME_POSTFIXES = '|'.join(['INC', 'CORP', 'LLC', 'LTD'])
def replace_company_name(company_name, text):
# 1. replace any postfixes in the company name. E.G. Green Grass LLC. -> Green Grass
company_name_post_fixregex = rf'({COMPANY_NAME_POSTFIXES})?\\.?'
sanitized_company_name = re.sub(company_name_postfix_regex, '', company_name, flags=re.IGNORECASE)
# 2. replace any instances of the sanitized company name followed optionally by both a space and a company name postfix
search_string = rf'{sanitized_company_name}\\s?{company_name_postfix_regex}'
sanitized_text = re.sub(search_string, '', text, flags=re.IGNORECASE)
return sanitized_text如果单词不是公司名称,上述方法将导致替换正在使用的单词实例的副作用。例如,绿草有限责任公司“在绿草区域的中心有很多被照顾的绿草”在区域的中心有很多被照顾的->
如果不希望出现这种副作用,则需要清理公司名称的大写形式的工作描述,或者计算并传入公司名称数组。
https://stackoverflow.com/questions/67791573
复制相似问题