文章/答案/技术大牛

发布

社区首页 >问答首页 >给定文档聚类，计算语料库和聚类之间的相似度。

问给定文档聚类，计算语料库和聚类之间的相似度。
EN

Stack Overflow用户

提问于 2018-06-18 22:06:03

回答 1查看 677关注 0票数 2

我正在通过计算语料库中的每个文档与集群之间的距离来进行相似性排序。集群还作为文档列表给出。我遇到的麻烦是，我无法找到，一种计算集群的质心的适当方法，这样我就可以计算相似度。我尝试使用聚类的tfidf矩阵的平均值，但结果很差。

例如:我的集群是：

['Line a baking pan with a sheet of parchment paper.',
 'Line the cake pan with parchment paper.',
 'Line the bottom with parchment paper.',
 'Line a baking pan with parchment paper.'
]

我的语录包含以下3份文件：

['Add vinegar and sugar.',
 'Remove pan from heat and let stand 5 minutes.',
 'Line the pan with parchment paper.'
]

我希望计算每个文档与集群之间的相似性，这可能会产生如下结果：

[0.1, 0.1, 0.8]

你有什么建议吗？我尝试将聚类和语料库文档表示为tfidf矩阵，但通过计算两个矩阵之间的相似性似乎很难给出期望的结果。我尝试了LSI，但我想要的是语料库，而不是集群文档，迫使我找到集群的质心代表。

python

pandas

numpy

nltk

tf-idf

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-19 19:04:05

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

cluster = ['Line a baking pan with a sheet of parchment paper.',
            'Line the cake pan with parchment paper.',
            'Line the bottom with parchment paper.',
            'Line a baking pan with parchment paper.']

corpus = ['Add vinegar and sugar.',
          'Remove pan from heat and let stand 5 minutes.',
          'Line the pan with parchment paper.']

# Train tfidf on cluster
tfidf = TfidfVectorizer()
tfidf_cluster = tfidf.fit_transform(cluster)

# Tranform the corpus using the trained tfidf
tfidf_corpus = tfidf.transform(corpus)

# Cosine similarity
cos_similarity = np.dot(tfidf_corpus, tfidf_cluster.T).A
avg_similarity = np.mean(cos_similarity, axis=1)

cos_similarity
Out[271]: 
array([[0.        , 0.        , 0.        , 0.        ],
       [0.31452723, 0.36145869, 0.        , 0.43855558],
       [0.50673521, 0.8242027 , 0.7139548 , 0.70655744]])

avg_similarity
Out[272]: array([0.        , 0.27863537, 0.68786254])

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50918092

复制

相似问题

问给定文档聚类，计算语料库和聚类之间的相似度。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问给定文档聚类，计算语料库和聚类之间的相似度。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问给定文档聚类，计算语料库和聚类之间的相似度。
EN