下面的代码接受两个参数。一组句子和一组查询。每个查询都有一个或多个单词。下面的函数为每个包含该句子中所有单词的查询打印出句子索引。
这是功能代码。但是,由于它是如何编写的,所以没有对大型输入进行优化。
我们如何使它更加优化运行时呢?任何想法都将不胜感激。
def textQueries(sentences, queries):
sentences_sets = [set(s.split(' ')) for s in sentences]
for i, words in enumerate(queries):
results = ''
for j, s in enumerate(sentences_sets):
exists = True
for word in words.split(' '):
if word not in s:
exists = False
if exists:
results += ' %d' % j if results else '%d' % j
if results:
print results
else: print -1发布于 2019-04-23 05:11:43
欢迎来到代码评审!
因此,我假设queries是我们正在寻找的单词列表,其中任何一个都是成功的。我还假设句子是一个字符串列表,每个元素都可以细分成一个单词列表。
我注意到您现有的代码不会看到“这是一个测试”。因为这段时间里有“测试”这个词。您可能需要将逗号、句点、括号等分隔开才能正常工作,但这可能是出于设计。我不太确定。
我看到的第一件可以改进的是你不需要一个旗子。而不是设置exists = False,然后继续循环,只要添加结果,如果它在那里,然后只是中断。如果已经找到了,则不需要搜索整个句子,如果要在循环中执行一个操作,则不需要标记。
所以:
for word in words.split(' '):
if word in s:
results += ' %d' % j if results else '%d' % j
break如果开头匹配,这将为长句节省几次迭代。
现在,如果您想彻底重写O(n)传递的整个过程(这是文本搜索可能得到的最好的),那么您可以使用"Trie“数据结构。它的工作方式是将你所有的文本,每个字一次一个字符,插入到一个树的变体中。这是O(n)插入,其中n是输入数据中的字符数。然后,对于每一个搜索词,你都会逐字逐句地找到与之匹配的单词。这是O(n),其中n是每个搜索项中的字符数,因此理论上影响比前一个要小得多。
我修改了在这里发布的代码(在公共域下):https://towardsdatascience.com/implementing-a-trie-data-structure-in-python-in-less-than-100-lines-of-code-a877ea23c1a1
它大部分不是我自己的,但我修改了它以适应你的功能。
class Node(object):
def __init__(self, character):
self.character = character
self.children = []
# A flag to say whether or not we are at the end of our current word.
self.finished = False
# How many times this character appeared in the addition process
self.count = 1
class Trie(object):
def __init__(self):
# Create a root node with a non-character attribute so it won't be confused
# with any of the entries.
self.root = Node(None)
def add(self, word):
# Set our current node to the start/root.
current_node = self.root
for char in word:
in_child = False
# Search for the character in the children of the present node
for child in current_node.children:
if child.character == char:
# We found it, increase the counter by 1 to keep track that another
# word has it as well
child.count += 1
# And point the node to the child that contains this char
current_node = child
in_child = True
break
# We did not find it so add a new chlid
if not in_child:
new_node = Node(char)
current_node.children.append(new_node)
# And then point node to the new child
current_node = new_node
# Everything finished. Mark it as the end of a word.
current_node.word_finished = True
def find_term(self, term):
"""
Check and return
1. If the prefix exsists in any of the words we added so far
2. If yes then how may words actually have the prefix
"""
node = self.root
# If the root node has no children, then return False.
# Because it means we are trying to search in an empty trie
if not self.root.children:
return False, 0
for char in term:
char_not_found = True
# Search through all the children of the present `node`
for child in node.children:
if child.character == char:
# We found the char existing in the child.
char_not_found = False
# Assign node as the child containing the char and break
node = child
break
# Return False anyway when we did not find a char.
if char_not_found:
return False, 0
# Well, we are here means we have found the prefix. Return true to indicate that
# And also the counter of the last node. This indicates how many words have this
# prefix
return True, node.count
def textQueries(sentences, queries):
trie = Trie()
sentences_list = [s.split(' ') for s in sentences]
for sentence in sentences_list:
for word in sentence:
trie.add(word)
for words in queries:
words_list = words.split(' ')
results_list = []
for word in words_list:
results = trie.find_term(word)
if results[0]:
results_list.append(results[1])
if results_list:
print results_list
else: print -1这将返回:
Venkee Enterprises:>python test.py
[2]
[1]我输入的数据如下:
sentences = ["Hello world, this is a sentence.", "This is also a sentence.", "This, however, is an incomplete"]
queries = ["sentence", "Hello"]因为它能够找到两个与查询“句子”匹配的单词和一个匹配"Hello“的单词。
希望这能有所帮助。
https://codereview.stackexchange.com/questions/217904
复制相似问题