当我在这个数据集上使用列转换器使用OneHotEncoder时,它会产生压缩的稀疏行格式。编码之后,我想使用train_test_split拆分数据,但这显示了以下错误:
Singleton array array(<32561x105 sparse matrix of type '<class 'numpy.float64'>'
with 394963 stored elements in Compressed Sparse Row format>,
dtype=object) cannot be considered a valid collection.首先,我处理缺少的值,像这样
from sklearn.impute import SimpleImputer
imputer_nominal = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
imputer_numerical = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer_nominal.fit(x[:,[1,3,5,6,7,8,9,13]])
x[:,[1,3,5,6,7,8,9,13]] = imputer_nominal.transform(x[:,[1,3,5,6,7,8,9,13]])
imputer_numerical.fit(x[:,[0,2,4,10,11,12]])
x[:,[0,2,4,10,11,12]] = imputer_numerical.transform(x[:,[0,2,4,10,11,12]])然后我对数据进行编码:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [1,3,5,6,7,8,9,13])], remainder = 'passthrough')
x = np.array(ct.fit_transform(x))当我输出numpy数组'x‘时,它看起来是压缩的稀疏行格式
(0, 6) 1.0
(0, 17) 1.0
(0, 28) 1.0
(0, 31) 1.0
(0, 46) 1.0
(0, 55) 1.0
(0, 57) 1.0
(0, 96) 1.0
(0, 99) 39.0在此之后,我尝试拆分数据,它显示了上面的错误。我以前使用过列转换器和OneHotEncoder,但是我不知道这个有什么问题。而且,我在这段代码中的任何地方都不使用code库。
发布于 2022-03-04 19:52:51
我的ColumnTransformer还创建了压缩稀疏数据。我在函数中做了sparse_threshold=0。其默认值为0.3。这似乎是一个新的属性/值,因为我看过ColumnTransformer的视频不需要它,并创建了相同的结果。这是我的密码,如果有帮助的话。
原始守则:
#This data is for a car sales CSV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
one_hot,
categorical_features)],
remainder ='passthrough',
sparse_threshold=0)
transformed_X = transformer.fit_transform(X)
transformed_X[:1], pd.DataFrame(transformed_X).head()原始产出:
(<1x16 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>,
0
0 (0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n...
1 (0, 0)\t1.0\n (0, 6)\t1.0\n (0, 13)\t1.0\n...
2 (0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n...
3 (0, 3)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n...
4 (0, 2)\t1.0\n (0, 6)\t1.0\n (0, 11)\t1.0\n...)用Sparse_thresh编写的代码:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
one_hot,
categorical_features)],
remainder ='passthrough',
sparse_threshold=0)
transformed_X = transformer.fit_transform(X)
#put in data frame for viewing
transformed_X[:1], pd.DataFrame(transformed_X).head()带有sparse_threshold=0的输出代码(第15列为里程计值):
(array([[0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00,
0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
3.5431e+04]]),
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
4 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
15
0 35431.0
1 192714.0
2 84714.0
3 154365.0
4 181577.0 )https://stackoverflow.com/questions/65463463
复制相似问题