更新:此问题是由regex模块中的一个bug引起的,该错误由开发人员在提交be893e9中解决。
如果遇到类似的问题,请更新regex模块。
您需要版本2017.04.23或更高版本。
看这里获取更多信息。
背景:我正在使用第三方Text2Speech引擎中的正则表达式集合(Text2Speech),在发言之前对输入文本进行规范化。
出于调试目的,我编写了下面的脚本,以查看regex集合对输入文本的实际影响。
我的问题是它取代了 根本不匹配的正则表达式
我有3份文件:
regex_preview.py
#!/usr/bin/env python
import codecs
import regex as re
input="Text2Speach Regex Test.txt"
dictionary="english.lex"
with codecs.open(dictionary, "r", "utf16") as f:
reg_exen = f.readlines()
with codecs.open(input, "r+", "utf16") as g:
content = g.read().replace(r'\\\\\"','"')
# apply all regular expressions to content
for line in reg_exen:
line=line.strip()
# skip comments
if line == "" or line[0] == "#":
pass
else:
# remove " from lines and split them into pattern and substitue
pattern=re.sub('" "(.*[^\\\\])?"$','', line)[1:].replace('\\"','"')
substitute=re.sub('\\\\"', '"', re.sub('^".*[^\\\\]" "', '', line)[:-1]).replace('\\"','"')
print("\n'%s' ==> '%s'" % (pattern, substitute))
print(content.strip())
content = re.sub(pattern, substitute, content)
print(content.strip())english.lex - utf16编码
# punctuation normalization
"(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+" "\""
"(…|—)" "..."
# stammered words: more general version accepting all words like ab... abcde (stammered words with vocal in stammered part)
"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t\f ]?)+(\1\w{2,})" "\1-\2"
# this should not match, but somehow it does o.OText2Speach Regex Test.txt - utf16编码
“Erm….yes. Thank you for that.”运行脚本会生成这个输出,最后一个正则表达式与内容匹配:
'(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+' ==> '"'
“Erm….yes. Thank you for that.”
"Erm….yes. Thank you for that."
'(…|—)' ==> '...'
"Erm….yes. Thank you for that."
"Erm....yes. Thank you for that."
'(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})' ==> '\1-\2'
"Erm....yes. Thank you for that."
"-yes. Thank you for that."到目前为止我尝试过的是:
我创建了这个片段来重现这个问题:
#!/usr/bin/env python
import re
import codecs
content = u'"Erm....yes. Thank you for that."\n'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
content = re.sub(pattern, substitute, content)
print(content)但这实际上是它应该做的。所以我不知道这里发生了什么。
希望有人能给我指明进一步调查的方向.
发布于 2017-04-22 16:05:31
原始脚本使用的是替代的regex模块,而不是标准的库re模块。
import regex as re在这种情况下,两者之间显然有一些不同。我的猜测是,这与嵌套组有关。这个表达式包含一个非捕获组内的捕获组,这对我的味觉来说太神奇了。
import re # standard library
import regex # completely different implementation
content = '"Erm....yes. Thank you for that."'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))输出:
"Erm....yes. Thank you for that."
"-yes. Thank you for that."https://stackoverflow.com/questions/43560759
复制相似问题