我有一个字符串和规则/映射来替换和不替换。
例如。
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."替换规则:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}结果:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."其他标准:
我在想在Python3.x中最干净的解决这个问题的方法是什么?
发布于 2020-10-18 16:26:59
经过一些研究,这是我认为是最好和最干净的解决方案,我的问题。解决方案是在找到匹配时调用match_fun,而match_fun只执行替换,当且仅当与当前匹配没有“无替换短语”重叠。如果你需要更多的澄清,或者你认为有什么可以改进的话,请告诉我。
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)发布于 2020-10-15 12:42:04
基于解模的answer。
更新
对不起,我错过了一个事实,那就是只有整句话应该被替换。我更新了我的代码,甚至将其概括为在函数中使用。
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence通过使用finditer(),使用正则表达式查找字符串的所有出现和值(因为我们需要检查字符串是一个完整的单词还是嵌入到任何类型的单词中)。您可能需要将rx调整为您对“整个单词”的定义。然后,获取no_replace规则大小的这些值周围的上下文。然后检查上下文是否包含no_replace字符串。如果没有,您可以用replace()替换它,只对单词使用,然后替换上下文中出现的单词,然后替换整个文本中的上下文。这样,替换过程几乎是独一无二的,不应该发生奇怪的行为。
使用您的例子,这将导致:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."和
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'https://stackoverflow.com/questions/64371185
复制相似问题