文章/答案/技术大牛

发布

社区首页 >问答首页 >将文本语料库转换为文本文档，并使用vocabulary_id和相应的tfidf评分

问将文本语料库转换为文本文档，并使用vocabulary_id和相应的tfidf评分
EN

Stack Overflow用户

提问于 2016-11-22 12:39:10

回答 1查看 614关注 0票数 0

我有一个包含5个文档的文本语料库，每个文档都用/n来分隔，我想为文档中的每个单词提供一个id，并计算其各自的tfidf评分。例如，假设我们有一个名为"corpus.txt“的文本语料库，如下所示：

在计算tfidf时，“跨流文本矢量化scikit python参与稀疏csr”

mylist =list("corpus.text")
vectorizer= CountVectorizer
x_counts = vectorizer_train.fit_transform(mylist) 
tfidf_transformer = TfidfTransformer()
x_tfidf = tfidf_transformer.fit_transform(x_counts)

输出是

(0,12) 0.1234 #for 1st document
(1,8) 0.3456  #for 2nd  document
(1,4) 0.8976
(2,15) 0.6754 #for third document
(2,14) 0.2389
(2,3) 0.7823
(3,11) 0.9897 #for fourth document
(3,13) 0.8213
(3,5) 0.7722
(3,6) 0.2211
(4,7) 0.1100 # for fifth document
(4,10) 0.6690
(4,2) 0.0912
(4,9) 0.2345
(4,1) 0.1234

我将此scipy.sparse.csr矩阵转换为列表列表，以删除文档id，并仅保留vocabulary_id及其相应的tfidf评分，使用：

m = x_tfidf.tocoo()
mydata = {k: v for k, v in zip(m.col, m.data)} 
key_val_pairs = [str(k) + ":" + str(v) for k, v in mydata.items()]

但问题是，我得到了一个输出，其中vocabulary_id及其相应的tfidf分数是按升序排列的，并且没有任何参考文档。

例如，对于上述给定的语料库，我的当前输出(我已经使用json转储到文本文件中)如下所示：

1:0.1234
2:0.0912
3:0.7823
4:0.8976
5:0.7722
6:0.2211
7:0.1100
8:0.3456
9:0.2345
10:0.6690
11:0.9897
12:0.1234
13:0.8213
14:0.2389
15:0.6754

然而，我希望我的文本文件如下所示：

12:0.1234
8:0.3456 4:0.8976
15:0.1234 14:0.2389 3:0.7823
11:0.9897 13:0.8213 5:0.7722 6:0.2211
7:0.1100 10:0.6690 2:0.0912 9:0.2345 1:0.1234

知道怎么做吗？

python

machine-learning

text-mining

tf-idf

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-11-23 05:11:28

我想这就是你需要的。这里，corpus是文档的集合。

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["stack over flow stack over flow text vectorization scikit", "stack over flow"]

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(corpus) # corpus is a collection of documents

print(vectorizer.vocabulary_) # vocabulary terms and their index
print(x) # tf-idf weights for each terms belong to a particular document

这些指纹：

{'vectorization': 5, 'text': 4, 'over': 1, 'flow': 0, 'stack': 3, 'scikit': 2}
  (0, 2)    0.33195438857 # first document, word = scikit
  (0, 5)    0.33195438857 # word = vectorization
  (0, 4)    0.33195438857 # word = text
  (0, 0)    0.472376562969 # word = flow
  (0, 1)    0.472376562969 # word = over
  (0, 3)    0.472376562969 # word = stack
  (1, 0)    0.57735026919 # second document
  (1, 1)    0.57735026919
  (1, 3)    0.57735026919

根据这些信息，您可以按照以下方式表示所需的文档：

cx = x.tocoo()
doc_id = -1
for i,j,v in zip(cx.row, cx.col, cx.data):
    if doc_id == -1:
        print(str(j) + ':' + "{:.4f}".format(v), end=' ')
    else:
        if doc_id != i:
            print()
        print(str(j) + ':' + "{:.4f}".format(v), end=' ')
    doc_id = i

这些指纹：

2:0.3320 5:0.3320 4:0.3320 0:0.4724 1:0.4724 3:0.4724 
0:0.5774 1:0.5774 3:0.5774

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40742105

复制

相似问题

问将文本语料库转换为文本文档，并使用vocabulary_id和相应的tfidf评分
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将文本语料库转换为文本文档，并使用vocabulary_id和相应的tfidf评分EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将文本语料库转换为文本文档，并使用vocabulary_id和相应的tfidf评分
EN