首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >文本搜索优化

文本搜索优化
EN

Code Review用户
提问于 2019-04-22 19:45:20
回答 1查看 55关注 0票数 5

下面的代码接受两个参数。一组句子和一组查询。每个查询都有一个或多个单词。下面的函数为每个包含该句子中所有单词的查询打印出句子索引。

这是功能代码。但是,由于它是如何编写的,所以没有对大型输入进行优化。

我们如何使它更加优化运行时呢?任何想法都将不胜感激。

代码语言:javascript
复制
def textQueries(sentences, queries):
    sentences_sets = [set(s.split(' ')) for s in sentences]
    for i, words in enumerate(queries):
        results = ''
        for j, s in enumerate(sentences_sets):
            exists = True
            for word in words.split(' '):
                if word not in s:
                    exists = False
            if exists:
                results += ' %d' % j if results else '%d' % j
        if results:
            print results
        else: print -1
EN

回答 1

Code Review用户

发布于 2019-04-23 05:11:43

欢迎来到代码评审!

因此,我假设queries是我们正在寻找的单词列表,其中任何一个都是成功的。我还假设句子是一个字符串列表,每个元素都可以细分成一个单词列表。

Bugs:

我注意到您现有的代码不会看到“这是一个测试”。因为这段时间里有“测试”这个词。您可能需要将逗号、句点、括号等分隔开才能正常工作,但这可能是出于设计。我不太确定。

添加一个中断:

我看到的第一件可以改进的是你不需要一个旗子。而不是设置exists = False,然后继续循环,只要添加结果,如果它在那里,然后只是中断。如果已经找到了,则不需要搜索整个句子,如果要在循环中执行一个操作,则不需要标记。

所以:

代码语言:javascript
复制
for word in words.split(' '):
    if word in s:
        results += ' %d' % j if results else '%d' % j
        break

如果开头匹配,这将为长句节省几次迭代。

尝试:

现在,如果您想彻底重写O(n)传递的整个过程(这是文本搜索可能得到的最好的),那么您可以使用"Trie“数据结构。它的工作方式是将你所有的文本,每个字一次一个字符,插入到一个树的变体中。这是O(n)插入,其中n是输入数据中的字符数。然后,对于每一个搜索词,你都会逐字逐句地找到与之匹配的单词。这是O(n),其中n是每个搜索项中的字符数,因此理论上影响比前一个要小得多。

我修改了在这里发布的代码(在公共域下):https://towardsdatascience.com/implementing-a-trie-data-structure-in-python-in-less-than-100-lines-of-code-a877ea23c1a1

它大部分不是我自己的,但我修改了它以适应你的功能。

代码语言:javascript
复制
class Node(object):
    def __init__(self, character):
        self.character = character
        self.children = []
        # A flag to say whether or not we are at the end of our current word.
        self.finished = False
        # How many times this character appeared in the addition process
        self.count = 1

class Trie(object):
    def __init__(self):
        # Create a root node with a non-character attribute so it won't be confused
        # with any of the entries.
        self.root = Node(None)

    def add(self, word):
        # Set our current node to the start/root.
        current_node = self.root
        for char in word:
            in_child = False
            # Search for the character in the children of the present node
            for child in current_node.children:
                if child.character == char:
                    # We found it, increase the counter by 1 to keep track that another
                    # word has it as well
                    child.count += 1
                    # And point the node to the child that contains this char
                    current_node = child
                    in_child = True
                    break
            # We did not find it so add a new chlid
            if not in_child:
                new_node = Node(char)
                current_node.children.append(new_node)
                # And then point node to the new child
                current_node = new_node
        # Everything finished. Mark it as the end of a word.
        current_node.word_finished = True


    def find_term(self, term):
        """
        Check and return
          1. If the prefix exsists in any of the words we added so far
          2. If yes then how may words actually have the prefix
        """
        node = self.root
        # If the root node has no children, then return False.
        # Because it means we are trying to search in an empty trie
        if not self.root.children:
            return False, 0
        for char in term:
            char_not_found = True
            # Search through all the children of the present `node`
            for child in node.children:
                if child.character == char:
                    # We found the char existing in the child.
                    char_not_found = False
                    # Assign node as the child containing the char and break
                    node = child
                    break
            # Return False anyway when we did not find a char.
            if char_not_found:
                return False, 0
        # Well, we are here means we have found the prefix. Return true to indicate that
        # And also the counter of the last node. This indicates how many words have this
        # prefix
        return True, node.count

def textQueries(sentences, queries):
    trie = Trie()
    sentences_list = [s.split(' ') for s in sentences]
    for sentence in sentences_list:
        for word in sentence:
            trie.add(word)

    for words in queries:
        words_list = words.split(' ')
        results_list = []
        for word in words_list:
            results = trie.find_term(word)
            if results[0]:
                results_list.append(results[1])
        if results_list:
            print results_list
        else: print -1

这将返回:

代码语言:javascript
复制
Venkee Enterprises:>python test.py 
[2]
[1]

我输入的数据如下:

代码语言:javascript
复制
sentences = ["Hello world, this is a sentence.", "This is also a sentence.", "This, however, is an incomplete"]
queries = ["sentence", "Hello"]

因为它能够找到两个与查询“句子”匹配的单词和一个匹配"Hello“的单词。

希望这能有所帮助。

票数 4
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/217904

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档