我有一个问题,要找到最快的方法来检查一个子字符串是否在字符串中作为一个完整的单词或术语。目前,我正在使用RegEx,但我需要执行数千次验证,而RegEx的速度非常慢。
对此有很多回应的方法。更简单的验证方法是substring in string
substring = "programming"
string = "Python is a high-level programming language"
substring in string
>>> True另一方面,当我们需要将子字符串作为一个完整的单词或术语查找时,这是一种天真的解决方案:
substring = "program"
string = "Python is a high-level programming language"
substring in string
>>> True另一种解决方案是将字符串拆分为一个单词列表,并验证子字符串是否在该列表中:
substring = "program"
string = "Python is a high-level programming language"
substring in string.split()
>>> False尽管如此,如果子字符串是一个术语,则不起作用。要解决这个问题,另一个解决方案是使用RegEx:
import re
substring = "high-level program"
string = "Python is a high-level programming language"
re.search(r"\b{}\b".format(substring), string) != None
>>> False但是,我最大的问题是,如果您需要执行数千次验证,那么解决方案是非常缓慢的。
为了缓解这个问题,我创建了一些方法,虽然它们比RegEx快(对于我需要的用途),但仍然比substring in string慢得多。
substring = "high-level program"
string = "Python is a high-level programming language"
all([word in string.split() for word in substring.split()])
>>> False虽然很简单,但上述方法并不适合,因为它忽略子字符串的词序,如果子字符串是"programming high-level",则返回"programming high-level",这与RegEx中的解决方案不同。因此,我创建了另一种方法来验证子字符串是否在ngram列表中,其中每个ngram具有与子字符串相同的单词数:
from nltk import ngrams
substring = "high-level program"
string = "Python is a high-level programming language"
ngram = list(ngrams(string.split(), len(substring.split())))
substring in [" ".join(tuples) for tuples in ngram]
>>> False编辑:这里的是一个不太慢的版本,使用相同的原则,但只使用内置的函数:
substring = "high-level program"
string = "Python is a high-level programming language"
length = len(substring.split())
words = string.split()
ngrams = [" ".join(words[i:i+length]) for i in range(len(words) - length)]
substring in ngrams
>>> False有人知道一些更快的方法来找到字符串中的子字符串作为一个完整的单词或术语吗?
发布于 2022-11-14 02:42:22
只需按子字符串长度循环遍历字符串并拼接字符串,并将拼接字符串与子字符串进行比较,如果其子字符串相等,则返回True。
插图*
strs = "Coding"
substr = "ding"
slen = 4
i = 0
check = strs[i:slen+i]==substr
# 1st iteration
strs[0:4+0] == ding
codi == ding # False
# 2nd iteration
i=1
strs[1:4+1] == ding
odin == ding # False
# 3rd iteration
i=2
strs [2:4+2] == ding
ding == ding # True溶液
def str_exist(string, substring, slen):
for i in range(len(string)):
if string[i:slen+i] == substring:
return True
return False
substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)
print(str_exist(string, substring, slen))输出
True发布于 2022-11-14 03:40:50
看看这个。我在代码中添加了注释,以便更好地理解该算法所做的事情。
def check_substr(S: str, sub_str: str) -> bool:
"""
This function tells whether the given sub-string
in a string is present or not.
Parameters
S: str: The original string
sub_str: str: The sub-string to be checked
Returns
result: boolean: Whether the string is present or not
"""
i = 0
pointer = 0
while (i < len(S)):
# This means that we are already in that word
# whose sub-part is already matched. For eg:
# `program` in `programming`. Therefore we are
# going to skip the rest of the word and check
# the next word instead.
if (S[i] != ' ' and pointer == len(sub_str)):
while (i < len(S) and S[i] != ' '):
i += 1
i += 1
pointer = 0
if (i >= len(S)):
break
# If we encounter a space, we check whether we
# have already found the sub-string or not.
elif (S[i] == ' ' and pointer == len(sub_str)):
break
if (S[i] == sub_str[pointer]):
pointer += 1
else:
# If the current element of the original
# string matched with the first element of
# the sub-string then we increment the
# pointer by 1. Otherwise we set it to 0.
pointer = 1 if (S[i] == sub_str[0]) else 0
i += 1
return pointer == len(sub_str)
S = "Python is a high-level programming"
print(check_substr(S, "high-level program"))
print(check_substr(S, "programming language"))输出
False
False时间复杂度
O(n)编辑:
正如@PGHE在评论中指出的那样,我们也可以在标点符号中进行检查,而不仅仅是在空格中。因为OP没有提到标点符号,所以我保留这个答案。
发布于 2022-12-02 21:52:01
在子字符串和字符串的两边添加空格,然后测试‘字符串中的子字符串’。
https://stackoverflow.com/questions/74426371
复制相似问题