首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >基于替换和不替换规则的子字符串替换

基于替换和不替换规则的子字符串替换
EN

Stack Overflow用户
提问于 2020-10-15 12:02:19
回答 2查看 118关注 0票数 3

我有一个字符串和规则/映射来替换和不替换。

例如。

代码语言:javascript
复制
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."

替换规则:

代码语言:javascript
复制
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}

结果:

代码语言:javascript
复制
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

其他标准:

  1. 只在大小写匹配的情况下替换,即只替换
  2. 全词替换,应忽略插入,但在替换后保留。

我在想在Python3.x中最干净的解决这个问题的方法是什么?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-10-18 16:26:59

经过一些研究,这是我认为是最好和最干净的解决方案,我的问题。解决方案是在找到匹配时调用match_fun,而match_fun只执行替换,当且仅当与当前匹配没有“无替换短语”重叠。如果你需要更多的澄清,或者你认为有什么可以改进的话,请告诉我。

代码语言:javascript
复制
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.

def match_fun(match: re.Match):
    str_match: str = match.group()

    if str_match not in cls.no_replace_dict:
        return cls.replace_dict[str_match]
    
    for no_replace in cls.no_replace_dict[str_match]:
            
        no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
        for no_replace_match in no_replace_matches_iter:

            if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
                return str_match
            
            if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
                return str_match
        
    return cls.replace_dict[str_match]

for replace in cls.replace_dict:
    pattern = re.compile(r'\b' + replace + r'\b')
    text = pattern.sub(match_fun, text)
票数 0
EN

Stack Overflow用户

发布于 2020-10-15 12:42:04

基于解模的answer

更新

对不起,我错过了一个事实,那就是只有整句话应该被替换。我更新了我的代码,甚至将其概括为在函数中使用。

代码语言:javascript
复制
def replace_whole(sentence, replace_token, replace_with, dont_replace):
    rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
    iter = re.finditer(rx, sentence)
    out_sentence = ""
    found = []
    indices = []
    for m in iter:
        indices.append(m.start(0))
        found.append(m.group())

    context_size=len(dont_replace)
    for i in range(len(indices)):
        context = sentence[indices[i]-context_size:indices[i]+context_size]
        if dont_replace in context:
            continue
        else:
            # First replace the word only in the substring found
            to_replace = found[i].replace(replace_token, replace_with)
            # Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
            replace_val = context.replace(found[i], to_replace)
            # finally replace the context found with the replacing context
            out_sentence = sentence.replace(context, replace_val)
            
    return out_sentence

通过使用finditer(),使用正则表达式查找字符串的所有出现和值(因为我们需要检查字符串是一个完整的单词还是嵌入到任何类型的单词中)。您可能需要将rx调整为您对“整个单词”的定义。然后,获取no_replace规则大小的这些值周围的上下文。然后检查上下文是否包含no_replace字符串。如果没有,您可以用replace()替换它,只对单词使用,然后替换上下文中出现的单词,然后替换整个文本中的上下文。这样,替换过程几乎是独一无二的,不应该发生奇怪的行为。

使用您的例子,这将导致:

代码语言:javascript
复制
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."

代码语言:javascript
复制
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64371185

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档