文章/答案/技术大牛

发布

社区首页 >问答首页 >Python regex替换不应该匹配的字符串

问Python regex替换不应该匹配的字符串
EN

Stack Overflow用户

提问于 2017-04-22 15:05:20

回答 1查看 2.1K关注 0票数 1

更新：此问题是由regex模块中的一个bug引起的，该错误由开发人员在提交be893e9中解决。

如果遇到类似的问题，请更新regex模块。

您需要版本2017.04.23或更高版本。

看这里获取更多信息。

背景:我正在使用第三方Text2Speech引擎中的正则表达式集合(Text2Speech)，在发言之前对输入文本进行规范化。

出于调试目的，我编写了下面的脚本，以查看regex集合对输入文本的实际影响。

我的问题是它取代了 根本不匹配的正则表达式

我有3份文件：

regex_preview.py

#!/usr/bin/env python
import codecs
import regex as re

input="Text2Speach Regex Test.txt"
dictionary="english.lex"

with codecs.open(dictionary, "r", "utf16") as f:
    reg_exen = f.readlines()
    with codecs.open(input, "r+", "utf16") as g:
        content = g.read().replace(r'\\\\\"','"')

        # apply all regular expressions to content
        for line in reg_exen:
            line=line.strip()

            # skip comments
            if line == "" or line[0] == "#":
                pass
            else:
                # remove " from lines and split them into pattern and substitue
                pattern=re.sub('" "(.*[^\\\\])?"$','', line)[1:].replace('\\"','"')
                substitute=re.sub('\\\\"', '"', re.sub('^".*[^\\\\]" "', '', line)[:-1]).replace('\\"','"')

                print("\n'%s' ==> '%s'" % (pattern, substitute))

                print(content.strip())
                content = re.sub(pattern, substitute, content)
                print(content.strip())

english.lex - utf16编码

# punctuation normalization
"(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+" "\""
"(…|—)" "..."

# stammered words: more general version accepting all words like ab... abcde (stammered words with vocal in stammered part)
"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t\f ]?)+(\1\w{2,})" "\1-\2"
# this should not match, but somehow it does o.O

Text2Speach Regex Test.txt - utf16编码

“Erm….yes. Thank you for that.”

运行脚本会生成这个输出，最后一个正则表达式与内容匹配：

'(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+' ==> '"'
“Erm….yes. Thank you for that.”
"Erm….yes. Thank you for that."

'(…|—)' ==> '...'
"Erm….yes. Thank you for that."
"Erm....yes. Thank you for that."

'(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})' ==> '\1-\2'
"Erm....yes. Thank you for that."
"-yes. Thank you for that."

到目前为止我尝试过的是：

我创建了这个片段来重现这个问题：

#!/usr/bin/env python

import re
import codecs

content = u'"Erm....yes. Thank you for that."\n'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
content = re.sub(pattern, substitute, content)

print(content)

但这实际上是它应该做的。所以我不知道这里发生了什么。

希望有人能给我指明进一步调查的方向.

python

regex

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-04-22 16:05:31

原始脚本使用的是替代的regex模块，而不是标准的库re模块。

import regex as re

在这种情况下，两者之间显然有一些不同。我的猜测是，这与嵌套组有关。这个表达式包含一个非捕获组内的捕获组，这对我的味觉来说太神奇了。

import re     # standard library
import regex  # completely different implementation

content = '"Erm....yes. Thank you for that."'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"

print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))

输出：

"Erm....yes. Thank you for that."
"-yes. Thank you for that."

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43560759

复制

相似问题

问Python regex替换不应该匹配的字符串
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python regex替换不应该匹配的字符串EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python regex替换不应该匹配的字符串
EN