文章/答案/技术大牛

发布

社区首页 >问答首页 >使用字符串列表作为模式拆分字符串

问使用字符串列表作为模式拆分字符串
EN

Stack Overflow用户

提问于 2014-08-20 19:31:58

回答 2查看 2.3K关注 0票数 3

考虑输入字符串：

mystr = "just some stupid string to illustrate my question"

以及指示在何处拆分输入字符串的字符串列表：

splitters = ["some", "illustrate"]

输出应该如下所示

result = ["just ", "some stupid string to ", "illustrate my question"]

我编写了一些实现以下方法的代码。对于splitters中的每个字符串，我都会在输入字符串中找到其出现的情况，并插入一些我所知道的不属于输入字符串的内容(例如，这个'!!')。然后，我使用刚才插入的子字符串拆分字符串。

for s in splitters:
    mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)

result = re.split('!!', mystr)

这个解决方案看起来很难看，有更好的方法吗？

python

regex

split

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-08-20 19:42:07

使用re.split拆分总是会从输出中删除匹配的字符串(注意，这并不完全正确，请参见下面的编辑)。因此，必须使用正查找表达式((?=...))进行匹配，而不删除匹配。但是，re.split 忽略空匹配，因此简单地使用前瞻性表达式不起作用。相反，您的至少会在每个拆分的上丢失一个字符(甚至试图用“边界”匹配(\b)欺骗re不起作用)。如果您不关心在每个项的末尾丢失一个空格/非单词字符(假设您只对非单词字符进行拆分)，则可以使用以下内容

re.split(r"\W(?=some|illustrate)")

这会给

["just", "some stupid string to", "illustrate my question"]

(请注意，just和to后面的空格丢失了)。然后，您可以使用str.join以编程方式生成这些正则表达式。请注意，每个拆分标记都使用re.escape转义，以便splitters项中的特殊字符不会以任何不希望的方式影响正则表达式的含义(例如，想象一下其中一个字符串中的)，否则会导致正则表达式语法错误)。

the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))

编辑 (HT到https://stackoverflow.com/questions/25412996/split-a-string-using-a-list-of-strings-as-a-pattern#comment39642468_25413153)：对实际匹配进行分组，即使用(\W)而不是\W，将插入到列表中的非单词字符作为单独的项返回。然后，将每两个后续项目连接起来也会产生所需的列表。然后，您还可以通过使用(.)而不是\W来取消使用非字字符的要求。

the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]

由于普通文本和辅助字符交替，the_split[::2]包含普通拆分文本，the_split[1::2]包含辅助字符。然后，使用itertools.izip_longest将每个文本项与相应的移除字符和最后一个项(在已删除字符中不匹配)与fillvalue (即'' )组合起来。然后，使用"".join(x)连接这些元组中的每个元组。请注意，这需要导入itertools (当然可以在一个简单的循环中这样做，但是itertools为这些事情提供了非常干净的解决方案)。还请注意，itertools.izip_longest在Python3中被称为itertools.zip_longest。

这将进一步简化正则表达式，因为可以用一个简单的匹配组((some|interesting)而不是(.)(?=some|interesting))代替查找，而不是使用辅助字符：

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

在这里，the_raw_split上的切片索引已经交换，因为现在必须将偶数项添加到项目之后，而不是前面。还请注意[""] +部分，它是将第一项与""配对以修复订单所必需的。

(编辑结束)

或者，您可以(如果愿意)对每个拆分器使用string.replace而不是re.sub (我认为这在您的情况下是首选的问题，但通常情况下可能更有效)

for s in splitters:
    mystr = mystr.replace(s, "!!" + s)

此外，如果您使用固定令牌来指示要拆分的位置，则不需要re.split，而是可以使用string.split：

result = mystr.split("!!")

您还可以做的是使用string.find查找输入中的拆分字符串，并使用字符串切片来提取片段，而不是依赖替换令牌而不在任何其他地方的字符串中，或者依赖于每个拆分位置前面有一个非单词字符：

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

在这里，[i for i in (string.find(s) for s in splitters) if i > 0]生成一个位置列表，其中可以找到拆分器，用于字符串中的所有拆分器(对此，i < 0被排除在外)，但在开始位置不正确(我们(可能)刚刚拆分，因此i == 0也被排除在外)。如果字符串中还有任何剩余部分，则生成(这是一个生成器函数)直到(不包括)第一个拆分器(在min(split_positions))的所有内容，并将字符串替换为其余的部分。如果没有剩下的字符串，则生成字符串的最后一部分并退出函数。因为这使用了yield，所以它是一个生成器函数，所以需要使用list将其转化为实际的列表。

请注意，您也可以用对yield whatever的调用代替some_list.append (如果您在前面定义了some_list )并在最后返回some_list，但我认为这不是很好的代码风格。

TL;DR

如果您对使用正则表达式没有意见，请使用

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

否则，也可以使用具有以下拆分函数的string.find实现相同的功能：

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

票数 7

Stack Overflow用户

发布于 2014-08-20 19:45:32

不是特别优雅，但避免正则表达式：

mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))

print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']

我应该在此承认，如果splitters中的单词不止发生一次，则需要做更多的工作，因为str.index只找到单词第一次出现的位置。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25412996

复制

相似问题

问使用字符串列表作为模式拆分字符串
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用字符串列表作为模式拆分字符串EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用字符串列表作为模式拆分字符串
EN