我在一个rapidsai码头容器中与HDBSCAN一起使用rapidsai:rapidsai/rapidsai-core:0.18-cuda11.0-runtime-ubuntu18.04-py3.7。
import cudf
import cupy
from cuml.manifold import UMAP
import hdbscan
from sklearn.datasets import make_blobs
from cuml.experimental.preprocessing import StandardScaler
blobs, labels = make_blobs(n_samples=100000, n_features=10)
df_gpu=cudf.DataFrame(blobs)
scaler= StandardScaler()
cupy_scaled=scaler.fit_transform(df_gpu.values)
projector= UMAP(n_components=3, n_neighbors=2000)
cupy_projected=projector.fit_transform(cupy_scaled)
numpy_projected=cupy.asnumpy(cupy_projected)
clusterer= hdbscan.HDBSCAN(min_cluster_size=1000, prediction_data=True, gen_min_span_tree=True)#, core_dist_n_jobs=1)
clusterer.fit(numpy_projected)我得到了一个错误,如果我使用core_dist_n_jobs=1,它是固定的,但会使代码变慢:
TerminatedWorkerError跟踪(最近一次调用)在1 clusterer= hdbscan.HDBSCAN(min_cluster_size=1000,prediction_data=True,( gen_min_span_tree=True) ->2 clusterer.fit(numpy_projected)
/opt/conda/envs/rapids/lib/python3.7/site-packages/hdbscan/hdbscan_.py in fit(self,X,y) 917 self._condensed_tree,918 self._single_linkage_tree,-> 919 self._min_spanning_tree) = hdbscan(X,**kwargs) 920921如果self.prediction_data:
/opt/conda/envs/rapids/lib/python3.7/site-packages/hdbscan/hdbscan_.py (X,min_cluster_size,min_samples,alpha,cluster_selection_epsilon,度量,p,leaf_size,算法,内存,approx_min_span_tree,gen_min_span_tree,core_dist_n_jobs,cluster_selection_method,allow_single_cluster,match_reference_implementation,**kwargs) 613 approx_min_span_tree,614 gen_min_span_tree,-> 615 core_dist_n_jobs,**kwargs) 616其他:度量是一个有效的BallTree度量617 #:需要启发式来决定何时去boruvka;
/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/memory.py in call(self,*args,**kwargs) 350 351 def call(self,*args,**kwargs):-> 352返回self.func(*args,**kwargs) 353 354 def call_and_shelve(self,*args,**kwargs):
/opt/conda/envs/rapids/lib/python3.7/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X,min_samples,alpha,度量,p,leaf_size,approx_min_span_tree,gen_min_span_tree,core_dist_n_jobs,**kwargs) 276 leaf_size=leaf_size // 3,277个approx_min_span_tree=approx_min_span_tree,-> 278 n_jobs=core_dist_n_jobs,**kwargs) 279 min_spanning_tree = alg.spanning_tree()按重量排序min_spanning_tree的280 #边
hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init()中的hdbscan/_hdbscan_boruvka.pyx
hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()中的hdbscan/_hdbscan_boruvka.pyx
/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py in call(self,iterable) 10521053与self._backend.retrieval_context():-> 1054 self.retrieve() 1055 #确保收到最后一条消息,告诉我们我们已经完成了1056
elapsed_time = time.time() - self._start_time
检索(Self) 931中的/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/parallel.py : 932 if getattr(self._backend,'supports_timeout',False):-> 933 self._output.extend(job.get(timeout=self.timeout)) 934 in : 935 self._output.extend(job.get())
在/opt/conda/envs/rapids/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(未来,超时) 540 AsyncResults.get从多进程。“”541尝试:-> 542返回future.result(timeout=timeout) 543除外,CfTimeoutError作为e: 544从e提高TimeoutError。
/opt/conda/envs/rapids/lib/python3.7/concurrent/futures/_base.py in结果(self,超时值) 433引发CancelledError() 434 elif self._state ==完成:->435个返回self.__get_result() 436,其他: 437提出TimeoutError()
/opt/conda/envs/rapids/lib/python3.7/concurrent/futures/_base.py in __get_result(self) 382 def __get_result(self):383 if self._exception:-> 384举起self._exception 385 return : 386返回self._result
TerminatedWorkerError:执行器管理的工作进程意外终止。这可能是由于调用函数时发生了分段错误,或者是由于过多的内存使用导致操作系统杀死了工作人员。
工人的出口代码是{ exit (1)}
有没有办法解决这个问题,但仍然保持HDBSCAN的快速?
发布于 2021-03-15 22:31:57
尝试将min_samples设置为值
在https://github.com/scikit-learn-contrib/hdbscan/issues/345#issuecomment-628749332中,lmcinnes说,“如果您的min_cluster_size很大而min_samples没有设置,您可能会遇到问题。您可以尝试将min_samples设置为较小的东西,看看这是否有帮助。”我注意到您的代码中没有设置min_samples。
https://stackoverflow.com/questions/66607544
复制相似问题