我有一根这样的绳子:
my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"像这样的清单:
my_list = ['C#', 'Django' 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']我想从my_string中提取每个可能的my_list单词。
这就是我所期望的:
['PHP', 'Software-Engineering', 'C', 'Oracle Cload', 'IT-Security market', 'Databases and Queries']这就是我试过的:
import re
try:
user_inps = re.findall(r'\w+', my_string)
extracted_inputs = set()
for user_inp in user_inps:
if user_inp.lower() in set(map(lambda x: x.lower(), my_list)):
extracted_inputs.add(user_inp)
except Exception:
extracted_inputs = set()但我明白了:
['php', 'C']效率也是我关心的问题。任何帮助都将不胜感激。
发布于 2020-05-14 23:30:30
如果您想要避免re,可以执行大多数w/纯Python操作。对于成千上万个单词的列表来说,这将是足够快的。
基本计划:清理标点符号,标记所有内容,使用集合进行匹配。对于小型应用程序,您可以在关键字中修改标记,以省略查找"and“之类的内容。
my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
my_list = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']
# make table of tokens : phrases
keywords = {}
for word in my_list:
# split each word into tokens
tokens = {w.lower() for w in word.replace('-',' ').split()}
for t in tokens:
keywords[t] = word
# tokenize the string my_string
# note: this is specifically tailored to your input with commas and hyphens, you may need to
# make this more universal
my_string_tokens = {t.lower() for t in my_string.replace(',','').replace('-',' ').split()}
# now you can just intersect the sets, which is much more efficient than nested looping
matches = my_string_tokens & set(keywords.keys())
for match in matches: # do what you want here...
print(f'token: {match:20s}-> {keywords[match]}')生产:
token: queries -> Databases and Queries
token: php -> PHP
token: oracle -> Oracle Cload
token: engineering -> Software-Engineering
token: databases -> Databases and Queries
token: software -> Software-Engineering
token: and -> Databases and Queries
token: security -> IT-Security market发布于 2020-05-14 22:49:58
由于解决方案需要高效,而且我们一开始就需要几千个,所以我建议您使用Bloom Filter实现。
TL;DR
Bloom过滤器是一种数据结构,旨在快速、高效地告诉您一个元素是否存在于一个集合中。多读点书或者在这里试试。
代码:
from bloom_filter import BloomFilter # pip install bloom-filter
from nltk.util import ngrams
import re
def clean(s):
s = s.replace(",", " ").replace("-", " ").replace(".", " ").lower()
return re.sub(r'\s+', ' ', s)
def clean_wo_space(s):
s = s.replace(",", " ").replace("-", " ").replace(".", " ").lower()
return re.sub(r'\s+', '', s)
def _initialize_bloom(phrases: list):
bloom = BloomFilter(max_elements=1000, error_rate=0.1)
for phrase in phrases:
bloom.add(clean_wo_space(phrase))
return bloom
def main():
phrases_repo = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cloud', 'React', 'Flask',
'IT-Security market', 'Databases and Queries']
input_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. C# should be another opetion, databases and queries"
initialized_bloom = _initialize_bloom(phrases_repo)
n_grams = set([' '.join(gram) for n in range(1, 4)
for gram in ngrams(clean(input_string).split(), n)])
matches = [i for i in n_grams if clean_wo_space(i) in initialized_bloom]
print(matches) # output ['c#', 'databases and queries', 'php', 'software engineering']
if __name__ == '__main__':
main()这一办法:
to_match关键字存储库数组,并通过规范化方法解析它--更低的情况下--单词,删除特殊字符等等。bloom filter对象,将您的normalized_to_match存储为哈希。bloom filter,您可以接受输入字符串并通过相同的规范化器方法来解析它(这样两个字符串都是相同的格式和规范化的)。n-grams,其中n是要匹配的短语的最大字数。
to_match = ["hello", "world", "Foo Bar", "Hey there it's me"] # n would be 4n_grams_array上迭代以检查是否存在bloom filter。如果它返回true,那么它就意味着单词存在。这一办法的优点:
发布于 2020-05-14 22:38:35
我们可以拆分列表中的关键字,并搜索string.lower()中的每个元素。如果有催眠术,我们也需要检查和拆分催眠药。
我还假设您忘记在Django之后在列表中添加,。
my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
my_list = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']
result =[]
for idx, keyword in enumerate(my_list):
if '-' in keyword:
keyword = keyword.split('-')
else:
keyword = keyword.split()
for word in keyword:
if word.lower() in my_string.lower() and my_list[idx] not in result and len(word) > 1:
result.append(my_list[idx])
result
['Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'IT-Security market', 'Databases and Queries']https://stackoverflow.com/questions/61808382
复制相似问题