我正在训练一种无人监督的学习模式。数据集有1,40,000行和6列。文件大小为10,637 KB的csv类型。
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib qt
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering以上导入的库。
Rev = pd.read_csv(r"Updated_Rev.csv")
labelEncoder = LabelEncoder()
labelEncoder.fit(Rev["Technology"])
Rev["Technology"] = labelEncoder.transform(Rev["Technology"])一个列是这样编码的字符串,但将来可能需要在培训中不使用它。
train = Rev.iloc[:,:4]
clustering = AgglomerativeClustering()
clustering.fit(train)这是列车文件,所以所有行都是培训所必需的,并从中选择了4列。在执行此操作时,出现了以下错误
MemoryError: Unable to allocate 67.9 GiB for an array with shape (9117833280,) and data type float64需要注意的点
MemoryError Traceback (most recent call last)
<ipython-input-16-0f4f354e9aaf> in <module>
1 train = Rev.iloc[:,:4]
----> 2 clustering.fit(train)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py in fit(self, X, y)
857 n_clusters=n_clusters,
858 return_distance=return_distance,
--> 859 **kwargs)
860 (self.children_,
861 self.n_connected_components_,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
353
354 def __call__(self, *args, **kwargs):
--> 355 return self.func(*args, **kwargs)
356
357 def call_and_shelve(self, *args, **kwargs):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py in ward_tree(X, connectivity, n_clusters, return_distance)
232 stacklevel=2)
233 X = np.require(X, requirements="W")
--> 234 out = hierarchy.ward(X)
235 children_ = out[:, :2].astype(np.intp)
236
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in ward(y)
828
829 """
--> 830 return linkage(y, method='ward', metric='euclidean')
831
832
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in linkage(y, method, metric, optimal_ordering)
1054 'matrix looks suspiciously like an uncondensed '
1055 'distance matrix')
-> 1056 y = distance.pdist(y, metric)
1057 else:
1058 raise ValueError("`y` must be 1 or 2 dimensional.")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py in pdist(X, metric, *args, **kwargs)
2002 out = kwargs.pop("out", None)
2003 if out is None:
-> 2004 dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
2005 else:
2006 if out.shape != (m * (m - 1) // 2,):
MemoryError: Unable to allocate 67.9 GiB for an array with shape (9117833280,) and data type float64发布于 2022-01-29 11:35:11
聚集性聚类不支持大数据。
根据定义,该算法需要O(n平方)内存和O(n立方)运行时。
采样数据或使用不同的算法
https://stackoverflow.com/questions/60522255
复制相似问题