我正在使用下面的函数查找n-gram。
from nltk.util import ngrams
booksAfterRemovingStopWords = ['Zombies and Calculus by Colin Adams', 'Zone to Win: Organizing to Compete in an Age of Disruption', 'Zig Zag: The Surprising Path to Greater Creativity']
booksWithNGrams = list()
for line_no, line in enumerate(booksAfterRemovingStopWords):
tokens = line.split(" ")
output = list(ngrams(tokens, 3))
temp = list()
for x in output: # Adding n-grams
temp.append(' '.join(x))
booksWithNGrams.append(temp)
print(booksWithNGrams)输出如下所示:
[['Zombies and Calculus', 'and Calculus by', 'Calculus by Colin', 'by Colin Adams'], ['Zone to Win:', 'to Win: Organizing', 'Win: Organizing to', 'Organizing to Compete', 'to Compete in', 'Compete in an', 'in an Age', 'an Age of', 'Age of Disruption'], ['Zig Zag: The', 'Zag: The Surprising', 'The Surprising Path', 'Surprising Path to', 'Path to Greater', 'to Greater Creativity']]但是,我不想要更多的三个n-gram。我的意思是我希望输出是这样的:
[['Zombies and Calculus', 'and Calculus by', 'Calculus by Colin'], ['Zone to Win:', 'to Win: Organizing', 'Win: Organizing to'], ['Zig Zag: The', 'Zag: The Surprising', 'The Surprising Path']]我怎样才能做到这一点呢?
发布于 2021-08-05 09:56:55
这就是你要做的:
逻辑:只需在循环中数到三,然后在i>2上中断(计数i=0,1,2和break )。
booksAfterRemovingStopWords = ['Zombies and Calculus by Colin Adams', 'Zone to Win: Organizing to Compete in an Age of Disruption', 'Zig Zag: The Surprising Path to Greater Creativity']
booksWithNGrams = list()
for line_no, line in enumerate(booksAfterRemovingStopWords):
tokens = line.split(" ")
output = list(ngrams(tokens, 3))
temp = list()
for i,x in enumerate(output):# Adding n-grams
if i>2:
break
temp.append(' '.join(x))
booksWithNGrams.append(temp)https://stackoverflow.com/questions/68664231
复制相似问题