文章/答案/技术大牛

发布

社区首页 >问答首页 >python 2.xvs 3.x中的Regex unicode

问python 2.xvs 3.x中的Regex unicode
EN

Stack Overflow用户

提问于 2015-11-01 14:54:16

回答 1查看 397关注 0票数 1

我有一个简单的函数来标记单词。

import re
def tokenize(string):
    return re.split("(\W+)(?<!')",string,re.UNICODE)

在python 2.7中，它的行为如下所示：

In [170]: tokenize('perché.')
Out[170]: ['perch', '\xc3\xa9.', '']

在python 3.5.0中，我得到了以下内容：

In [6]: tokenize('perché.')
Out[6]: ['perché', '.', '']

问题是，不应将“é”视为标记字符。我认为re.UNICODE可以使\W按我的意思工作吗？

如何在python2.x中获得python3.x的相同行为？

python

regex

python-2.7

unicode

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-11-01 18:05:25

您需要使用Unicode字符串，但是split的第三个参数不是flags，而是maxsplit

>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

示例：

#!coding:utf8
from __future__ import print_function
import re
def tokenize(string):
    return re.split(r"(\W+)(?<!')",string,flags=re.UNICODE)

print(tokenize(u'perché.'))

输出：

C:\>py -2 test.py
[u'perch\xe9', u'.', u'']

C:\>py -3 test.py
['perché', '.', '']

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/33463503

复制

相似问题

问python 2.xvs 3.x中的Regex unicode
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python 2.xvs 3.x中的Regex unicodeEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python 2.xvs 3.x中的Regex unicode
EN