文章/答案/技术大牛

发布

社区首页 >问答首页 >有没有一种方法可以计算Python文档集之间的余弦相似度？

问有没有一种方法可以计算Python文档集之间的余弦相似度？
EN

Stack Overflow用户

提问于 2022-04-26 16:31:45

回答 1查看 58关注 0票数 1

我试图计算文档集之间的余弦相似度。我正在使用这段代码，它运行得很好，但问题是它按降序对结果进行排序。是否有方法根据插入文档的比较顺序获得结果？还是有别的办法可以做到？提前感谢大家。

这是我正在使用的代码：

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances


documents = ['Machine learning is the study of computer algorithms that improve automatically through experience.\
Machine learning algorithms build a mathematical model based on sample data, known as training data.\
The discipline of machine learning employs various approaches to teach computers to accomplish tasks \
where no fully satisfactory algorithm is available.',
'A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned\
about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.\
Developing a machine learning application is more iterative and explorative process than software engineering.',
             'Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. \
It involves computers learning from data provided so that they carry out certain tasks.',
             'Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"\
or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement',
             'Software engineering is the systematic application of engineering approaches to the development of software.\
Software engineering is a computing discipline.',
'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.\
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.']
documents_df = pd.DataFrame(documents, columns=['documents'])

# removing special characters and stop words from the text
stop_words_l = stopwords.words('english')
documents_df['documents_cleaned'] = documents_df.documents.apply(lambda x: " ".join(
    re.sub(r'[^a-zA-Z]', ' ', w).lower() for w in x.split() if
    re.sub(r'[^a-zA-Z]', ' ', w).lower() not in stop_words_l))

tfidfvectoriser = TfidfVectorizer()
tfidfvectoriser.fit(documents_df.documents_cleaned)
tfidf_vectors = tfidfvectoriser.transform(documents_df.documents_cleaned)

pairwise_similarities = np.dot(tfidf_vectors, tfidf_vectors.T).toarray()
pairwise_differences = euclidean_distances(tfidf_vectors)

def most_similar(doc_id, similarity_matrix, matrix):
    print(similarity_matrix)
    print(f'Document: {documents_df.iloc[doc_id]["documents"]}')
    print('\n')
    print('Similar Documents:')
    if matrix == 'Cosine Similarity':
        similar_ix = np.argsort(similarity_matrix[doc_id])[::-1]
    elif matrix == 'Euclidean Distance':
        similar_ix = np.argsort(similarity_matrix[doc_id])
    for ix in similar_ix:
        if ix == doc_id:
            continue
        print('\n')
        print(f'Document: {documents_df.iloc[ix]["documents"]}')
        print(f'{matrix} : {similarity_matrix[doc_id][ix]}')

most_similar(0, pairwise_similarities, 'Cosine Similarity')
most_similar(0, pairwise_differences, 'Euclidean Distance')

这是输出：

文献:机器学习是研究通过experience.Machine学习算法自动改进的计算机算法，建立一个基于样本数据的数学模型，称为机器学习的训练data.The学科，它采用各种方法来教计算机在没有完全令人满意的算法的情况下完成任务。

类似文件：

文献:机器学习与计算统计密切相关，其重点是利用computers.The对数学优化的研究，为机器学习领域提供方法、理论和应用领域。余弦相似度: 0.22860560787391593

文件:机器学习涉及计算机发现他们如何能够执行任务，而不显式编程这样做。它涉及到计算机从提供的数据中学习，以便它们执行某些任务。余弦相似度: 0.22581304743529423

文件:机器学习方法传统上分为三大类，取决于学习系统可用的“信号”或“反馈”的性质:监督、无监督和增强余弦相似性: 0.15314340308039842。

文档:软件工程师根据计算机执行的逻辑创建程序。软件工程师必须更加关注程序在所有情况下的正确性。同时，数据科学家对不确定性很满意，而variability.Developing --机器学习应用程序比软件工程更具有迭代性和探索性。余弦相似度: 0.12407396777398046

文献:软件工程是工程方法的系统应用，是software.Software工程发展的一门计算学科。余弦相似度: 0.04978528121489196

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-04-26 17:08:35

我想这正是你想要的：

def most_similar(doc_id, similarity_matrix, matrix):
    print(similarity_matrix)
    print(f'Document: {documents_df.iloc[doc_id]["documents"]}')
    print('\n')
    print('Similar Documents:')
    if matrix == 'Cosine Similarity':
        similar_ix = similarity_matrix[doc_id][::-1]
    elif matrix == 'Euclidean Distance':
        similar_ix = similarity_matrix[doc_id]
    for i, ix in enumerate(similar_ix):
        if ix == doc_id:
            continue
        print('\n')
        print(f'Document: {documents_df.iloc[i]["documents"]}')
        print(f'{matrix} : {similarity_matrix[doc_id][i]}')

most_similar(0, pairwise_similarities, 'Cosine Similarity')
most_similar(0, pairwise_differences, 'Euclidean Distance')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72017489

复制

相似问题

问有没有一种方法可以计算Python文档集之间的余弦相似度？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有没有一种方法可以计算Python文档集之间的余弦相似度？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有没有一种方法可以计算Python文档集之间的余弦相似度？
EN