列出了一份词汇表:
glossaries = ['USA', '34']目标是使用术语表中的项,并使用术语表作为分隔符拆分字符串。例如,给定字符串和术语表,_isolate_glossaries()函数:
glossaries = ['USA', '34']
word = '1934USABUSA'
_isolate_glossaries(word, glossaries)应产出:
['19', '34', 'USA', 'B', 'USA']我试过:
def isolate_glossary(word, glossary):
print(word, glossary)
# Check that word == glossary and glossary not in word
if re.match('^{}$'.format(glossary), word) or not re.search(glossary, word):
return [word]
else:
segments = re.split(r'({})'.format(glossary), word)
segments, ending = segments[:-1], segments[-1] # Remove the last catch with null string.
return segments
def _isolate_glossaries(word, glossaries):
word_segments = [word]
for gloss in glossaries:
word_segments = [out_segment
for segment in word_segments
for out_segment in isolate_glossary(segment, gloss)]
return word_segments它可以工作,但它看起来有点太复杂,不能有这么多级别的循环和正则表达式分裂发生。是否有更好的方法根据术语表拆分字符串?
发布于 2019-01-14 07:31:10
若要将字符串拆分为列表中的项,请动态创建regex,包括由管道|分隔的那些项,这些项都包含在捕获组中(非捕获组不包括输出中的项本身):
list = re.split('({})'.format('|'.join(glossaries)), word);
print ([x for x in list if x]) # filter non-word items请参阅现场演示
https://stackoverflow.com/questions/54177043
复制相似问题