文章/答案/技术大牛

发布

社区首页 >问答首页 >如何计算文本中的短语并提取最频繁的短语？

问如何计算文本中的短语并提取最频繁的短语？
EN

Stack Overflow用户

提问于 2022-05-19 13:44:33

回答 1查看 61关注 0票数 0

我有一个包含列文本的dataset df：

text
the main goal is to develop a smart calendar
the main goal is to develop a smart calendar
the main goal is to develop a chat bot
it is clear that the main goal is to develop a product
ai products for department A
launching ai products for department B

正如你所看到的，文本中有很多常见的短语。我如何才能检测到它们并提取最常见的(例如，出现2次或更多次)。因此，期望的产出是：

text                                cnt
the main goal is to develop          4
ai products for department           2
ai products for department           2

之所以会有the main goal is to develop被捕获，但是the main goal is to等不是因为它是它们中最长的。

我怎么能这么做？

python

python-3.x

dataframe

function

回答 1

Stack Overflow用户

发布于 2022-05-19 15:15:02

您可以使用N克来完成此操作。主要的想法是：

对于每句话，得到n克，例如2克(双克)的“主要目标是开发一个智能日历”：['the main', 'main goal', 'goal is', 'is to', 'to develop', 'develop a', 'a smart', 'smart calendar']。
得到所有这些不同的n克短语，n的范围从1到len(sentence)。
计算它们的发生，将计数和长度存储到字典中。
用计数和长度对结果进行排序

对于python，您可以这样做：

text=['the main goal is to develop a smart calendar',
        'the main goal is to develop a smart calendar',
        'the main goal is to develop a chat bot',
        'it is clear that the main goal is to develop a product',
        'ai products for department A',
        'launching ai products for department B']


def get_ngram(word_list, n):
    ngram_list = [' '.join(word_list[i:i+n]) for i in range(len(word_list) - n + 1)]
    print(ngram_list)
    return ngram_list


def get_ngram_pieces(text):
    text_pieces = []
    for sentence in text:
        word_list = sentence.split()
        print(word_list)
        for n in range(1, len(word_list) + 1):
            ngram_list = get_ngram(word_list, n)
            text_pieces.extend(ngram_list)

    return text_pieces
    

def get_count(text_pieces):
    keys = set(text_pieces)
    phrase_dict = {}
    for key in keys:
        phrase_dict[key] = (text_pieces.count(key), len(key.split()))
    return phrase_dict

all_pieces = get_ngram_pieces(text)
phrase_dict = get_count(all_pieces)
phrase_dict_sorted = dict(sorted(phrase_dict.items(), key=lambda item: item[1], reverse=True))

phrase_dict_sorted的前十名是

is,5,1
the main goal is to develop a,4,7
the main goal is to develop,4,6
main goal is to develop a,4,6
goal is to develop a,4,5
the main goal is to,4,5
main goal is to develop,4,5
the main goal is,4,4
goal is to develop,4,4
is to develop a,4,4

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72305698

复制

相似问题

问如何计算文本中的短语并提取最频繁的短语？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何计算文本中的短语并提取最频繁的短语？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何计算文本中的短语并提取最频繁的短语？
EN