我正在绘制一组2D文本文档,我注意到了一些异常值,我想找出这些异常值是什么。我使用原始文本,然后使用内置到SKLearn中的SKLearn。
vectorizer = TfidfVectorizer(max_df=0.5, max_features=None,
min_df=2, stop_words='english',
use_idf=True, lowercase=True)
corpus = make_corpus(root)
X = vectorizer.fit_transform(corpus)为了减少到2D,我正在使用TruncatedSVD。
reduced_data = TruncatedSVD(n_components=2).fit_transform(X)如果我想找出哪一个文本文档有最高的第二主成分(y轴),我该怎么做呢?
发布于 2017-04-06 21:11:38
因此,据我所知,您想知道哪个文档最大限度地利用了特定的主成分。下面是我想出的玩具例子:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np
corpus = [
'this is my first corpus',
'this is my second corpus which is longer than the first',
'here is yet another one, but it is brief',
'and watch out for number four chuggin along',
'blah blah blah my final sentence yada yada yada'
]
vectorizer = TfidfVectorizer(stop_words='english',
use_idf=True, lowercase=True)
# first get TFIDF matrix
X = vectorizer.fit_transform(corpus)
# second compress to two dimensions
svd = TruncatedSVD(n_components=2).fit(X)
reduced = svd.transform(X)
# now, find the doc with the highest 2nd prin comp
corpus[np.argmax(reduced[:, 1])]产生的结果:
'and watch out for number four chuggin along'https://stackoverflow.com/questions/43263837
复制相似问题