我有一个包含列文本的dataset df:
text
the main goal is to develop a smart calendar
the main goal is to develop a smart calendar
the main goal is to develop a chat bot
it is clear that the main goal is to develop a product
ai products for department A
launching ai products for department B正如你所看到的,文本中有很多常见的短语。我如何才能检测到它们并提取最常见的(例如,出现2次或更多次)。因此,期望的产出是:
text cnt
the main goal is to develop 4
ai products for department 2
ai products for department 2之所以会有the main goal is to develop被捕获,但是the main goal is to等不是因为它是它们中最长的。
我怎么能这么做?
发布于 2022-05-19 15:15:02
您可以使用N克来完成此操作。主要的想法是:
['the main', 'main goal', 'goal is', 'is to', 'to develop', 'develop a', 'a smart', 'smart calendar']。n的范围从1到len(sentence)。对于python,您可以这样做:
text=['the main goal is to develop a smart calendar',
'the main goal is to develop a smart calendar',
'the main goal is to develop a chat bot',
'it is clear that the main goal is to develop a product',
'ai products for department A',
'launching ai products for department B']
def get_ngram(word_list, n):
ngram_list = [' '.join(word_list[i:i+n]) for i in range(len(word_list) - n + 1)]
print(ngram_list)
return ngram_list
def get_ngram_pieces(text):
text_pieces = []
for sentence in text:
word_list = sentence.split()
print(word_list)
for n in range(1, len(word_list) + 1):
ngram_list = get_ngram(word_list, n)
text_pieces.extend(ngram_list)
return text_pieces
def get_count(text_pieces):
keys = set(text_pieces)
phrase_dict = {}
for key in keys:
phrase_dict[key] = (text_pieces.count(key), len(key.split()))
return phrase_dict
all_pieces = get_ngram_pieces(text)
phrase_dict = get_count(all_pieces)
phrase_dict_sorted = dict(sorted(phrase_dict.items(), key=lambda item: item[1], reverse=True))phrase_dict_sorted的前十名是
is,5,1
the main goal is to develop a,4,7
the main goal is to develop,4,6
main goal is to develop a,4,6
goal is to develop a,4,5
the main goal is to,4,5
main goal is to develop,4,5
the main goal is,4,4
goal is to develop,4,4
is to develop a,4,4https://stackoverflow.com/questions/72305698
复制相似问题