文章/答案/技术大牛

发布

社区首页 >问答首页 >文本搜索优化

问文本搜索优化
EN

Code Review用户

提问于 2019-04-22 19:45:20

回答 1查看 55关注 0票数 5

下面的代码接受两个参数。一组句子和一组查询。每个查询都有一个或多个单词。下面的函数为每个包含该句子中所有单词的查询打印出句子索引。

这是功能代码。但是，由于它是如何编写的，所以没有对大型输入进行优化。

我们如何使它更加优化运行时呢？任何想法都将不胜感激。

def textQueries(sentences, queries):
    sentences_sets = [set(s.split(' ')) for s in sentences]
    for i, words in enumerate(queries):
        results = ''
        for j, s in enumerate(sentences_sets):
            exists = True
            for word in words.split(' '):
                if word not in s:
                    exists = False
            if exists:
                results += ' %d' % j if results else '%d' % j
        if results:
            print results
        else: print -1

python

performance

回答 1

Code Review用户

发布于 2019-04-23 05:11:43

欢迎来到代码评审！

因此，我假设queries是我们正在寻找的单词列表，其中任何一个都是成功的。我还假设句子是一个字符串列表，每个元素都可以细分成一个单词列表。

Bugs:

我注意到您现有的代码不会看到“这是一个测试”。因为这段时间里有“测试”这个词。您可能需要将逗号、句点、括号等分隔开才能正常工作，但这可能是出于设计。我不太确定。

添加一个中断：

我看到的第一件可以改进的是你不需要一个旗子。而不是设置exists = False，然后继续循环，只要添加结果，如果它在那里，然后只是中断。如果已经找到了，则不需要搜索整个句子，如果要在循环中执行一个操作，则不需要标记。

所以：

for word in words.split(' '):
    if word in s:
        results += ' %d' % j if results else '%d' % j
        break

如果开头匹配，这将为长句节省几次迭代。

尝试：

现在，如果您想彻底重写O(n)传递的整个过程(这是文本搜索可能得到的最好的)，那么您可以使用"Trie“数据结构。它的工作方式是将你所有的文本，每个字一次一个字符，插入到一个树的变体中。这是O(n)插入，其中n是输入数据中的字符数。然后，对于每一个搜索词，你都会逐字逐句地找到与之匹配的单词。这是O(n)，其中n是每个搜索项中的字符数，因此理论上影响比前一个要小得多。

我修改了在这里发布的代码(在公共域下)：https://towardsdatascience.com/implementing-a-trie-data-structure-in-python-in-less-than-100-lines-of-code-a877ea23c1a1

它大部分不是我自己的，但我修改了它以适应你的功能。

class Node(object):
    def __init__(self, character):
        self.character = character
        self.children = []
        # A flag to say whether or not we are at the end of our current word.
        self.finished = False
        # How many times this character appeared in the addition process
        self.count = 1

class Trie(object):
    def __init__(self):
        # Create a root node with a non-character attribute so it won't be confused
        # with any of the entries.
        self.root = Node(None)

    def add(self, word):
        # Set our current node to the start/root.
        current_node = self.root
        for char in word:
            in_child = False
            # Search for the character in the children of the present node
            for child in current_node.children:
                if child.character == char:
                    # We found it, increase the counter by 1 to keep track that another
                    # word has it as well
                    child.count += 1
                    # And point the node to the child that contains this char
                    current_node = child
                    in_child = True
                    break
            # We did not find it so add a new chlid
            if not in_child:
                new_node = Node(char)
                current_node.children.append(new_node)
                # And then point node to the new child
                current_node = new_node
        # Everything finished. Mark it as the end of a word.
        current_node.word_finished = True


    def find_term(self, term):
        """
        Check and return
          1. If the prefix exsists in any of the words we added so far
          2. If yes then how may words actually have the prefix
        """
        node = self.root
        # If the root node has no children, then return False.
        # Because it means we are trying to search in an empty trie
        if not self.root.children:
            return False, 0
        for char in term:
            char_not_found = True
            # Search through all the children of the present `node`
            for child in node.children:
                if child.character == char:
                    # We found the char existing in the child.
                    char_not_found = False
                    # Assign node as the child containing the char and break
                    node = child
                    break
            # Return False anyway when we did not find a char.
            if char_not_found:
                return False, 0
        # Well, we are here means we have found the prefix. Return true to indicate that
        # And also the counter of the last node. This indicates how many words have this
        # prefix
        return True, node.count

def textQueries(sentences, queries):
    trie = Trie()
    sentences_list = [s.split(' ') for s in sentences]
    for sentence in sentences_list:
        for word in sentence:
            trie.add(word)

    for words in queries:
        words_list = words.split(' ')
        results_list = []
        for word in words_list:
            results = trie.find_term(word)
            if results[0]:
                results_list.append(results[1])
        if results_list:
            print results_list
        else: print -1

这将返回：

Venkee Enterprises:>python test.py 
[2]
[1]

我输入的数据如下：

sentences = ["Hello world, this is a sentence.", "This is also a sentence.", "This, however, is an incomplete"]
queries = ["sentence", "Hello"]

因为它能够找到两个与查询“句子”匹配的单词和一个匹配"Hello“的单词。

希望这能有所帮助。

票数 4

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/217904

复制

相似问题

问文本搜索优化
EN

回答 1

Code Review用户

Bugs:

添加一个中断：

尝试：

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问文本搜索优化EN

回答 1

Code Review用户

Bugs:

添加一个中断：

尝试：

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问文本搜索优化
EN