文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从python的列表中提取字符串的单词组合

问如何从python的列表中提取字符串的单词组合
EN

Stack Overflow用户

提问于 2020-05-14 22:17:57

回答 5查看 421关注 0票数 0

我有一根这样的绳子：

my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"

像这样的清单：

my_list = ['C#', 'Django' 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']

我想从my_string中提取每个可能的my_list单词。

这就是我所期望的：

['PHP', 'Software-Engineering', 'C', 'Oracle Cload', 'IT-Security market', 'Databases and Queries']

这就是我试过的：

import re
try:
    user_inps = re.findall(r'\w+', my_string)
    extracted_inputs = set()
    for user_inp in user_inps:
        if user_inp.lower() in set(map(lambda x: x.lower(), my_list)):
            extracted_inputs.add(user_inp)
except Exception:
    extracted_inputs = set()

但我明白了：

['php', 'C']

效率也是我关心的问题。任何帮助都将不胜感激。

python-3.x

string

list

python

回答 5

Stack Overflow用户

回答已采纳

发布于 2020-05-14 23:30:30

如果您想要避免re，可以执行大多数w/纯Python操作。对于成千上万个单词的列表来说，这将是足够快的。

基本计划:清理标点符号，标记所有内容，使用集合进行匹配。对于小型应用程序，您可以在关键字中修改标记，以省略查找"and“之类的内容。

my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
my_list = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']

# make table of tokens : phrases
keywords = {}
for word in my_list:
    # split each word into tokens
    tokens = {w.lower() for w in word.replace('-',' ').split()}
    for t in tokens:
        keywords[t] = word


# tokenize the string my_string
# note:  this is specifically tailored to your input with commas and hyphens, you may need to
#        make this more universal
my_string_tokens = {t.lower() for t in my_string.replace(',','').replace('-',' ').split()}

# now you can just intersect the sets, which is much more efficient than nested looping
matches = my_string_tokens & set(keywords.keys())

for match in matches:  # do what you want here...
    print(f'token: {match:20s}->  {keywords[match]}')

生产：

token: queries             ->  Databases and Queries
token: php                 ->  PHP
token: oracle              ->  Oracle Cload
token: engineering         ->  Software-Engineering
token: databases           ->  Databases and Queries
token: software            ->  Software-Engineering
token: and                 ->  Databases and Queries
token: security            ->  IT-Security market

票数 0

Stack Overflow用户

发布于 2020-05-14 22:49:58

由于解决方案需要高效，而且我们一开始就需要几千个，所以我建议您使用Bloom Filter实现。

TL;DR

Bloom过滤器是一种数据结构，旨在快速、高效地告诉您一个元素是否存在于一个集合中。多读点书或者在这里试试。

代码：

from bloom_filter import BloomFilter  # pip install bloom-filter
from nltk.util import ngrams
import re


def clean(s):
    s = s.replace(",", " ").replace("-", " ").replace(".", " ").lower()
    return re.sub(r'\s+', ' ', s)


def clean_wo_space(s):
    s = s.replace(",", " ").replace("-", " ").replace(".", " ").lower()
    return re.sub(r'\s+', '', s)


def _initialize_bloom(phrases: list):
    bloom = BloomFilter(max_elements=1000, error_rate=0.1)
    for phrase in phrases:
        bloom.add(clean_wo_space(phrase))
    return bloom


def main():
    phrases_repo = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cloud', 'React', 'Flask',
                    'IT-Security market', 'Databases and Queries']

    input_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. C# should be another opetion, databases and queries"

    initialized_bloom = _initialize_bloom(phrases_repo)

    n_grams = set([' '.join(gram) for n in range(1, 4)
                   for gram in ngrams(clean(input_string).split(), n)])

    matches = [i for i in n_grams if clean_wo_space(i) in initialized_bloom]

    print(matches) # output ['c#', 'databases and queries', 'php', 'software engineering']


if __name__ == '__main__':
    main()

这一办法：

在应用程序开始迭代您的to_match关键字存储库数组，并通过规范化方法解析它--更低的情况下--单词，删除特殊字符等等。
训练一个bloom filter对象，将您的normalized_to_match存储为哈希。
现在，您已经准备好了您的bloom filter，您可以接受输入字符串并通过相同的规范化器方法来解析它(这样两个字符串都是相同的格式和规范化的)。
将规范化输入转换为n-grams，其中n是要匹配的短语的最大字数。 to_match = ["hello", "world", "Foo Bar", "Hey there it's me"] # n would be 4
以上步骤将为您提供所有可能存在的顺序词组合。
现在，只需在n_grams_array上迭代以检查是否存在bloom filter。如果它返回true，那么它就意味着单词存在。

这一办法的优点：

Bloom过滤器是非常快速的查找。特别是对于大型数据集。
获得模糊度的灵活性(不是真的)，但您可以将匹配的可信度配置为低，以获得模糊匹配(或假阳性)。

票数 1

Stack Overflow用户

发布于 2020-05-14 22:38:35

我们可以拆分列表中的关键字，并搜索string.lower()中的每个元素。如果有催眠术，我们也需要检查和拆分催眠药。

我还假设您忘记在Django之后在列表中添加,。

my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
my_list = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']

result =[]

for idx, keyword in enumerate(my_list):
    if '-' in keyword:
        keyword = keyword.split('-')
    else:
        keyword = keyword.split()
    for word in keyword:
        if word.lower() in my_string.lower() and my_list[idx] not in result and len(word) > 1:
            result.append(my_list[idx])


result
['Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'IT-Security market', 'Databases and Queries']

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61808382

复制

相似问题

问如何从python的列表中提取字符串的单词组合
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从python的列表中提取字符串的单词组合EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从python的列表中提取字符串的单词组合
EN