首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >检查子字符串是否作为一个完整的单词或术语存在于字符串中的最快方法是什么,比如有边界的RegEx?

检查子字符串是否作为一个完整的单词或术语存在于字符串中的最快方法是什么,比如有边界的RegEx?
EN

Stack Overflow用户
提问于 2022-11-14 02:30:13
回答 3查看 153关注 0票数 5

我有一个问题,要找到最快的方法来检查一个子字符串是否在字符串中作为一个完整的单词或术语。目前,我正在使用RegEx,但我需要执行数千次验证,而RegEx的速度非常慢。

对此有很多回应的方法。更简单的验证方法是substring in string

代码语言:javascript
复制
substring = "programming"
string = "Python is a high-level programming language"

substring in string

>>> True

另一方面,当我们需要将子字符串作为一个完整的单词或术语查找时,这是一种天真的解决方案:

代码语言:javascript
复制
substring = "program"
string = "Python is a high-level programming language"

substring in string

>>> True

另一种解决方案是将字符串拆分为一个单词列表,并验证子字符串是否在该列表中:

代码语言:javascript
复制
substring = "program"
string = "Python is a high-level programming language"

substring in string.split()

>>> False

尽管如此,如果子字符串是一个术语,则不起作用。要解决这个问题,另一个解决方案是使用RegEx:

代码语言:javascript
复制
import re

substring = "high-level program"
string = "Python is a high-level programming language"

re.search(r"\b{}\b".format(substring), string) != None

>>> False

但是,我最大的问题是,如果您需要执行数千次验证,那么解决方案是非常缓慢的。

为了缓解这个问题,我创建了一些方法,虽然它们比RegEx快(对于我需要的用途),但仍然比substring in string慢得多。

代码语言:javascript
复制
substring = "high-level program"
string = "Python is a high-level programming language"

all([word in string.split() for word in substring.split()])

>>> False

虽然很简单,但上述方法并不适合,因为它忽略子字符串的词序,如果子字符串是"programming high-level",则返回"programming high-level",这与RegEx中的解决方案不同。因此,我创建了另一种方法来验证子字符串是否在ngram列表中,其中每个ngram具有与子字符串相同的单词数:

代码语言:javascript
复制
from nltk import ngrams

substring = "high-level program"
string = "Python is a high-level programming language"

ngram = list(ngrams(string.split(), len(substring.split())))

substring in [" ".join(tuples) for tuples in ngram]

>>> False

编辑:这里的是一个不太慢的版本,使用相同的原则,但只使用内置的函数:

代码语言:javascript
复制
substring = "high-level program"
string = "Python is a high-level programming language"

length = len(substring.split())
words = string.split()
ngrams = [" ".join(words[i:i+length]) for i in range(len(words) - length)]

substring in ngrams

>>> False

有人知道一些更快的方法来找到字符串中的子字符串作为一个完整的单词或术语吗?

EN

回答 3

Stack Overflow用户

发布于 2022-11-14 02:42:22

只需按子字符串长度循环遍历字符串并拼接字符串,并将拼接字符串与子字符串进行比较,如果其子字符串相等,则返回True。

插图*

代码语言:javascript
复制
strs = "Coding"
substr = "ding"
slen = 4
i = 0

check = strs[i:slen+i]==substr

# 1st iteration
strs[0:4+0] == ding
codi == ding # False

# 2nd iteration
i=1
strs[1:4+1] == ding
odin == ding # False

# 3rd iteration
i=2
strs [2:4+2] == ding
ding == ding # True

溶液

代码语言:javascript
复制
def str_exist(string, substring, slen):
    for i in range(len(string)):
        if string[i:slen+i] == substring:
             return True
    return False

substring = "high-level program"
string = "Python is a high-level programming language"
slen = len(substring)

print(str_exist(string, substring, slen))

输出

代码语言:javascript
复制
True
票数 1
EN

Stack Overflow用户

发布于 2022-11-14 03:40:50

看看这个。我在代码中添加了注释,以便更好地理解该算法所做的事情。

代码语言:javascript
复制
def check_substr(S: str, sub_str: str) -> bool:
  """
  This function tells whether the given sub-string 
  in a string is present or not.
  
  Parameters
  S:       str: The original string
  sub_str: str: The sub-string to be checked
  
  Returns
  result: boolean: Whether the string is present or not
  """
  i = 0
  pointer = 0
  
  while (i < len(S)):
    # This means that we are already in that word
    # whose sub-part is already matched. For eg:
    # `program` in `programming`. Therefore we are
    # going to skip the rest of the word and check
    # the next word instead.
    if (S[i] != ' ' and pointer == len(sub_str)):
      while (i < len(S) and S[i] != ' '):
        i += 1
      i += 1
      pointer = 0
      
      if (i >= len(S)):
        break
    
    # If we encounter a space, we check whether we
    # have already found the sub-string or not.
    elif (S[i] == ' ' and pointer == len(sub_str)):
      break
      
    if (S[i] == sub_str[pointer]):
      pointer += 1
      
    else:
      # If the current element of the original 
      # string matched with the first element of
      # the sub-string then we increment the 
      # pointer by 1. Otherwise we set it to 0.
      pointer = 1 if (S[i] == sub_str[0]) else 0
      
    i += 1
  
  return pointer == len(sub_str)
  
S = "Python is a high-level programming"
print(check_substr(S, "high-level program"))
print(check_substr(S, "programming language"))

输出

代码语言:javascript
复制
False
False

时间复杂度

代码语言:javascript
复制
O(n)

编辑:

正如@PGHE在评论中指出的那样,我们也可以在标点符号中进行检查,而不仅仅是在空格中。因为OP没有提到标点符号,所以我保留这个答案。

票数 1
EN

Stack Overflow用户

发布于 2022-12-02 21:52:01

在子字符串和字符串的两边添加空格,然后测试‘字符串中的子字符串’。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74426371

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档