HDBSCAN有一个将其集群数据缓存为参数的标志,如下所述:
prediction_data :boolean, optional
Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later re-use you probably want to set this to True. (default False)现在,我看到在指定的位置,创建了以下文件夹结构:
>joblib
...>hdbscan
......>hdbscan_
.........>_hdbscan_boruvka_balltree
............>f1bd5f351764560c3532dbe30f273481
...............metadata.json
...............output.pkl
............func_code.py作为HDBSCAN文档的suggest,我们可以使用这些文件(可能是pickle文件)作为持久化存储,稍后可以重用它来查找新数据点的集群标签。但我找不到一种方法。
发布于 2021-09-07 05:09:49
我是在HDBSCAN中搜索缓存内存时来到这里的。我最初的搜索将我带到了https://joblib.readthedocs.io/en/latest/auto_examples/memory_basic_usage.html,在那里我找到了以下代码:
from joblib import Memory
location = './cachedir'
memory = Memory(location, verbose=0)但在使用它时,我得到了一个
DeprecationWarning: The 'cachedir' parameter has been deprecated in version 0.12 and will
be removed in version 0.14.
You provided "cachedir='/tmp/joblib'", use "location='/tmp/joblib'" instead.因此,导致使用joblib在HDBSCAN中缓存内存的更新代码
from joblib import Memory
location='/tmp/joblib'
memory = Memory(location, verbose=0)发布于 2020-08-31 05:14:43
您要查看的参数是memory=。如果您使用相同的memory=参数再次调用HDBSCAN,并且只更改(比方说)显式地在两次运行之间保持固定的min_samples的min_cluster_size,那么它将为您节省重新计算时间。
https://stackoverflow.com/questions/62704921
复制相似问题