文章/答案/技术大牛

发布

社区首页 >问答首页 >检查子字符串是否作为一个完整的单词或术语存在于字符串中的最快方法是什么，比如有边界的RegEx？

问检查子字符串是否作为一个完整的单词或术语存在于字符串中的最快方法是什么，比如有边界的RegEx？
EN

Stack Overflow用户

提问于 2022-11-14 02:30:13

回答 3查看 153关注 0票数 5

我有一个问题，要找到最快的方法来检查一个子字符串是否在字符串中作为一个完整的单词或术语。目前，我正在使用RegEx，但我需要执行数千次验证，而RegEx的速度非常慢。

对此有很多回应的方法。更简单的验证方法是substring in string

substring = "programming"
string = "Python is a high-level programming language"

substring in string

>>> True

另一方面，当我们需要将子字符串作为一个完整的单词或术语查找时，这是一种天真的解决方案：

substring = "program"
string = "Python is a high-level programming language"

substring in string

>>> True

另一种解决方案是将字符串拆分为一个单词列表，并验证子字符串是否在该列表中：

substring = "program"
string = "Python is a high-level programming language"

substring in string.split()

>>> False

尽管如此，如果子字符串是一个术语，则不起作用。要解决这个问题，另一个解决方案是使用RegEx：

import re

substring = "high-level program"
string = "Python is a high-level programming language"

re.search(r"\b{}\b".format(substring), string) != None

>>> False

但是，我最大的问题是，如果您需要执行数千次验证，那么解决方案是非常缓慢的。

为了缓解这个问题，我创建了一些方法，虽然它们比RegEx快(对于我需要的用途)，但仍然比substring in string慢得多。

substring = "high-level program"
string = "Python is a high-level programming language"

all([word in string.split() for word in substring.split()])

>>> False

虽然很简单，但上述方法并不适合，因为它忽略子字符串的词序，如果子字符串是"programming high-level"，则返回"programming high-level"，这与RegEx中的解决方案不同。因此，我创建了另一种方法来验证子字符串是否在ngram列表中，其中每个ngram具有与子字符串相同的单词数：

from nltk import ngrams

substring = "high-level program"
string = "Python is a high-level programming language"

ngram = list(ngrams(string.split(), len(substring.split())))

substring in [" ".join(tuples) for tuples in ngram]

>>> False

编辑:这里的是一个不太慢的版本，使用相同的原则，但只使用内置的函数：

substring = "high-level program"
string = "Python is a high-level programming language"

length = len(substring.split())
words = string.split()
ngrams = [" ".join(words[i:i+length]) for i in range(len(words) - length)]

substring in ngrams

>>> False

有人知道一些更快的方法来找到字符串中的子字符串作为一个完整的单词或术语吗？

python

regex

string

substring

contains

回答 3

Stack Overflow用户

发布于 2022-11-14 02:42:22

只需按子字符串长度循环遍历字符串并拼接字符串，并将拼接字符串与子字符串进行比较，如果其子字符串相等，则返回True。

插图*

strs = "Coding"
substr = "ding"
slen = 4
i = 0

check = strs[i:slen+i]==substr

# 1st iteration
strs[0:4+0] == ding
codi == ding # False

# 2nd iteration
i=1
strs[1:4+1] == ding
odin == ding # False

# 3rd iteration
i=2
strs [2:4+2] == ding
ding == ding # True

溶液

def str_exist(string, substring, slen):
    for i in range(len(string)):
        if string[i:slen+i] == substring:
             return True
    return False

substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)

print(str_exist(string, substring, slen))

输出

True

票数 1

Stack Overflow用户

发布于 2022-11-14 03:40:50

看看这个。我在代码中添加了注释，以便更好地理解该算法所做的事情。

def check_substr(S: str, sub_str: str) -> bool:
  """
  This function tells whether the given sub-string 
  in a string is present or not.
  
  Parameters
  S:       str: The original string
  sub_str: str: The sub-string to be checked
  
  Returns
  result: boolean: Whether the string is present or not
  """
  i = 0
  pointer = 0
  
  while (i < len(S)):
    # This means that we are already in that word
    # whose sub-part is already matched. For eg:
    # `program` in `programming`. Therefore we are
    # going to skip the rest of the word and check
    # the next word instead.
    if (S[i] != ' ' and pointer == len(sub_str)):
      while (i < len(S) and S[i] != ' '):
        i += 1
      i += 1
      pointer = 0
      
      if (i >= len(S)):
        break
    
    # If we encounter a space, we check whether we
    # have already found the sub-string or not.
    elif (S[i] == ' ' and pointer == len(sub_str)):
      break
      
    if (S[i] == sub_str[pointer]):
      pointer += 1
      
    else:
      # If the current element of the original 
      # string matched with the first element of
      # the sub-string then we increment the 
      # pointer by 1. Otherwise we set it to 0.
      pointer = 1 if (S[i] == sub_str[0]) else 0
      
    i += 1
  
  return pointer == len(sub_str)
  
S = "Python is a high-level programming"
print(check_substr(S, "high-level program"))
print(check_substr(S, "programming language"))

输出

False
False

时间复杂度

O(n)

编辑：

正如@PGHE在评论中指出的那样，我们也可以在标点符号中进行检查，而不仅仅是在空格中。因为OP没有提到标点符号，所以我保留这个答案。

票数 1

Stack Overflow用户

发布于 2022-12-02 21:52:01

在子字符串和字符串的两边添加空格，然后测试‘字符串中的子字符串’。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74426371

复制

相似问题

问检查子字符串是否作为一个完整的单词或术语存在于字符串中的最快方法是什么，比如有边界的RegEx？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查子字符串是否作为一个完整的单词或术语存在于字符串中的最快方法是什么，比如有边界的RegEx？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查子字符串是否作为一个完整的单词或术语存在于字符串中的最快方法是什么，比如有边界的RegEx？
EN