文章/答案/技术大牛

发布

社区首页 >问答首页 >用h2o4gpu中的K均值聚类文本文档

问用h2o4gpu中的K均值聚类文本文档
EN

Stack Overflow用户

提问于 2018-07-30 14:29:06

回答 1查看 185关注 0票数 0

我对使用h2o4gpu对文本文档进行集群感兴趣。作为参考，我遵循了本教程，但更改了代码以反映h2o4gpu。

from sklearn.feature_extraction.text import TfidfVectorizer
import h2o4gpu

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 2
model = h2o4gpu.KMeans(n_gpus=1, n_clusters=true_k, init='k-means++', 
max_iter=100, n_init=1)
model.fit(X)

但是，在运行上面的代码示例时，我会收到以下错误：

Traceback (most recent call last):
File "dev.py", line 20, in <module>
model.fit(X)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/kmeans.py", line 810, in fit
res = self.model.fit(X, y)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/kmeans.py", line 303, in fit
X_np, _, _, _, _, _ = _get_data(X, ismatrix=True)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/utils.py", line 119, in _get_data
data, ismatrix=ismatrix, dtype=dtype, order=order)
File "/home/greg/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/utils.py", line 79, in _to_np
outdata = outdata.astype(dtype, copy=False, order=nporder)
ValueError: setting an array element with a sequence.

我搜索过h2o4gpu.feature_extraction.text.TfidfVectorizer，但没有在h2o4gpu中找到它。话虽如此，有没有办法纠正这个问题呢？

软件版本

库达9.0，V9.0.176
cuDNN 7.1.3
Python 3.6.4
h2o4gpu 0.2.0
Scikit-学习0.19.1

k-means

h2o4gpu

python

python-3.x

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-07-30 14:59:40

X = TfidfVectorizer(stop_words='english').fit_transform(documents)

返回稀疏矩阵对象矩阵。

目前，在H2O4GPU中，我们只支持KMeans的密集表示。这意味着您必须将X转换为2D Python列表或2D Numpy数组，用0填充缺少的元素。

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
X_dense = X.toarray()

true_k = 2
model = h2o4gpu.KMeans(n_gpus=1, n_clusters=true_k, init='k-means++', 
max_iter=100, n_init=1)
model.fit(X_dense)

应该能起作用。这不是NLP的最佳解决方案，因为它可能需要更多的内存，但我们在路线图上还没有对KMeans的稀疏支持。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51596291

复制

相似问题

问用h2o4gpu中的K均值聚类文本文档
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用h2o4gpu中的K均值聚类文本文档EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用h2o4gpu中的K均值聚类文本文档
EN