首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何计算文本中的短语并提取最频繁的短语?

如何计算文本中的短语并提取最频繁的短语?
EN

Stack Overflow用户
提问于 2022-05-19 13:44:33
回答 1查看 61关注 0票数 0

我有一个包含列文本的dataset df:

代码语言:javascript
复制
text
the main goal is to develop a smart calendar
the main goal is to develop a smart calendar
the main goal is to develop a chat bot
it is clear that the main goal is to develop a product
ai products for department A
launching ai products for department B

正如你所看到的,文本中有很多常见的短语。我如何才能检测到它们并提取最常见的(例如,出现2次或更多次)。因此,期望的产出是:

代码语言:javascript
复制
text                                cnt
the main goal is to develop          4
ai products for department           2
ai products for department           2

之所以会有the main goal is to develop被捕获,但是the main goal is to等不是因为它是它们中最长的。

我怎么能这么做?

EN

回答 1

Stack Overflow用户

发布于 2022-05-19 15:15:02

您可以使用N克来完成此操作。主要的想法是:

  1. 对于每句话,得到n克,例如2克(双克)的“主要目标是开发一个智能日历”:['the main', 'main goal', 'goal is', 'is to', 'to develop', 'develop a', 'a smart', 'smart calendar']
  2. 得到所有这些不同的n克短语,n的范围从1len(sentence)
  3. 计算它们的发生,将计数和长度存储到字典中。
  4. 用计数和长度对结果进行排序

对于python,您可以这样做:

代码语言:javascript
复制
text=['the main goal is to develop a smart calendar',
        'the main goal is to develop a smart calendar',
        'the main goal is to develop a chat bot',
        'it is clear that the main goal is to develop a product',
        'ai products for department A',
        'launching ai products for department B']


def get_ngram(word_list, n):
    ngram_list = [' '.join(word_list[i:i+n]) for i in range(len(word_list) - n + 1)]
    print(ngram_list)
    return ngram_list


def get_ngram_pieces(text):
    text_pieces = []
    for sentence in text:
        word_list = sentence.split()
        print(word_list)
        for n in range(1, len(word_list) + 1):
            ngram_list = get_ngram(word_list, n)
            text_pieces.extend(ngram_list)

    return text_pieces
    

def get_count(text_pieces):
    keys = set(text_pieces)
    phrase_dict = {}
    for key in keys:
        phrase_dict[key] = (text_pieces.count(key), len(key.split()))
    return phrase_dict

all_pieces = get_ngram_pieces(text)
phrase_dict = get_count(all_pieces)
phrase_dict_sorted = dict(sorted(phrase_dict.items(), key=lambda item: item[1], reverse=True))

phrase_dict_sorted的前十名是

代码语言:javascript
复制
is,5,1
the main goal is to develop a,4,7
the main goal is to develop,4,6
main goal is to develop a,4,6
goal is to develop a,4,5
the main goal is to,4,5
main goal is to develop,4,5
the main goal is,4,4
goal is to develop,4,4
is to develop a,4,4
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72305698

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档