import regex as re
def tokenize(text):
return re.findall(r'[\w-][-]*\p{L}[\w-]*',text)
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
tokens= tokenize(text)
print("|".join(tokens))我的产出是这样的
let|defeat|the|SARS-coV-2|delta|variant|together|in
我想把下面的不-
|Let|s|defeat|the|SARS|CoV|Delta|variant|together|in
发布于 2021-09-04 03:04:38
您可以简化正则表达式模式,只需对您认为是单词分隔符的字符使用re.split(),例如撇号'、空格、破折号-等等。
from itertools import filterfalse
import regex as re
def tokenize(text):
splits = re.split("['\s\-]", text)
splits = list(filterfalse(lambda value: re.search("\d", value), splits)) # Remove this line if you wish to include the digits
if splits:
splits[0] = splits[0].capitalize()
return splits
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
tokens= tokenize(text)
print("|" + "|".join(tokens)) # Remove <"|" +> if you don't intend to put a "|" at the start.输出:
|Let|s|defeat|the|SARS|coV|delta|variant|together|in发布于 2021-09-04 04:29:51
您可以使用re.sub将一系列非字母替换为管道分隔符:
import re
def tokenize(text):
return re.sub(r"[^A-Za-z]+", "|", text)
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
print(tokenize(text))
let|s|defeat|the|SARS|coV|delta|variant|together|in|发布于 2021-09-04 11:44:56
如果这个“单词”至少包含一个字母,您希望提取任何单词字符序列。
您可以通过regex ( \p{L}可以用于匹配任何字母)和re模块( [^\W\d_]匹配任何字母)来实现这一点:
# Python PyPi regex:
import regex as re
def tokenize(text):
return re.findall(r'\w*\p{L}\w*',text)
# Python built-in re:
import re
def tokenize(text):
return re.findall(r'\w*[^\W\d_]\w*',text)import regex as re
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
def tokenize(text):
return re.findall(r'\w*\p{L}\w*',text)
print("|".join(tokenize(text)))
# => let|s|defeat|the|SARS|coV|delta|variant|together|inimport re
text ="let's defeat the SARS-coV-2 delta variant together in 2021!"
def tokenize(text):
return re.findall(r'\w*[^\W\d_]\w*',text)
print("|".join(tokenize(text)))
# => let|s|defeat|the|SARS|coV|delta|variant|together|inhttps://stackoverflow.com/questions/69052038
复制相似问题