在分类变量练习中,最后一部分是生成测试预测。我已经编写了以下代码,但得到了一个错误。我无法理解这个错误,为什么它说X有148个特征,随机森林期望155个特征。
我的代码:
ohencoder=OneHotEncoder(handle_unknown='ignore', sparse=False)
# X_test.dropna(axis=0, inplace=True)
h_cols_test = pd.DataFrame(ohencoder.fit_transform(X_test[low_cardinality_cols])) # Your code here
h_cols_test.index=X_test.index
num_X_test= X_test.drop(object_cols, axis=1)
OH_X_test=pd.concat([num_X_test, h_cols_test], axis=1)
#randomforest mode-----------------------------
model=RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(OH_X_train, y_train)
preds_test= model.predict(OH_X_test)
#output---------------
output=pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)错误消息:
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
FutureWarning,
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
FutureWarning,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_33/1524045498.py in <module>
12 model.fit(OH_X_train, y_train)
13
---> 14 preds_test= model.predict(OH_X_test)
15
16 output=pd.DataFrame({'Id': X_test.index,
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in predict(self, X)
969 check_is_fitted(self)
970 # Check data
--> 971 X = self._validate_X_predict(X)
972
973 # Assign chunk of trees to jobs
/opt/conda/lib/python3.7/site-packages/sklearn/ensemble/_forest.py in _validate_X_predict(self, X)
577 Validate X whenever one tries to predict, apply, predict_proba."""
578 check_is_fitted(self)
--> 579 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False)
580 if issparse(X) and (X.indices.dtype != np.intc or X.indptr.dtype != np.intc):
581 raise ValueError("No support for np.int64 index based sparse matrices")
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
583
584 if not no_val_X and check_params.get("ensure_2d", True):
--> 585 self._check_n_features(X, reset=reset)
586
587 return out
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _check_n_features(self, X, reset)
399 if n_features != self.n_features_in_:
400 raise ValueError(
--> 401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
403 )
ValueError: X has 148 features, but RandomForestRegressor is expecting 155 features as input.发布于 2022-07-04 21:16:03
您在培训和测试集中有不同数量的特性。因此,在训练中可能存在模型在测试中找不到的特征,或者在测试中没有对模型进行培训的特征。
造成此错误的一个可能原因是在每个数据集中单独执行的单一热编码:可能只有一个测试中存在分类变量的值。
一种解决方案是在拆分数据之前执行OHE,或者,您也可以在训练集中使用fit_transform,然后在测试集中只使用transform。记住,在处理新数据时,应该始终使用transform,这是所有scikit转换器的一般规则。
当然,您还应该确保所有其他的转换,如下垂列,在培训和测试集中都执行相同的操作。管道在这里你最好的朋友。
https://stackoverflow.com/questions/72861021
复制相似问题