我有一个数据集,以荷兰大城市的邻里能源消耗为因变量和几个自变量。我想做一个随机森林回归模型来预测只有阿姆斯特丹的居民区的值。现在,我试图仅在阿姆斯特丹的邻域上训练模型,但数据集太小,精度分数(RMSE,MAE,R2)很差,尽管该模型在整个large_city数据集上表现良好。
我主要想做的是在RF模型上做一个10折的交叉验证。我只想将阿姆斯特丹的数据分成10倍,然后我想将large_city数据集的其余部分(因此除了阿姆斯特丹的所有邻居)添加到所有折叠的训练集中,但保持测试折叠不变。
所以简而言之:
阿姆斯特丹= large_cities ==‘阿姆斯特丹’
without_amsterdam = large_cities !=‘阿姆斯特丹’
10折交叉验证,1/10的阿姆斯特丹作为测试数据,9/10的阿姆斯特丹+所有的without_amsterdam作为每折的训练数据。
到目前为止,我编写的代码如下:
from sklearn.model_selection import KFold, cross_val_score
amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']
X = amsterdam.iloc[:, 4:].values
y = np.array(amsterdam.iloc[:, 3].values)
# split the data into 10 folds.
# I will use this 'kf'(KFold splitting stratergy) object as
#input to cross_val_score() method
kf = KFold(n_splits=10, shuffle=True, random_state=42)
cnt = 1
# split() method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
cnt += 1
def rmse(score):
rmse = np.sqrt(-score)
print(f'rmse= {"{:.2f}".format(rmse)}')
score = cross_val_score(ensemble.RandomForestRegressor(random_state= 42),
X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold are: {score}')
rmse(score.mean())我在上面的代码中所做的是,我只对阿姆斯特丹的数据进行了10次交叉验证。如何将without_ams的数据添加到阿姆斯特丹的每个列车文件夹?
我希望这对我正在尝试做的事情有意义。
发布于 2021-05-13 21:45:46
您可以提供训练、测试到cross_val_score的索引,参见help page。因此,在您的案例中,使用示例数据集:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
big_cities = pd.DataFrame(np.random.normal(0,1,(200,6)))
big_cities.insert(0,'gm_naam',
np.random.choice(['Amsterdam','Stockholm','Copenhagen'],200))关键是将数据帧附加到阿姆斯特丹,然后是其他数据帧,您也可以通过排序来做到这一点:
amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']
non_amsterdam_index = np.arange(len(amsterdam),len(without_ams))
combined = pd.concat([amsterdam,without_ams])现在我们只使用阿姆斯特丹部分来获得cv索引:
X = amsterdam.iloc[:, 4:]
y = amsterdam.iloc[:, 3]
kf = KFold(n_splits=3, shuffle=True, random_state=42)我们将非阿姆斯特丹索引附加到每个列车文件夹:
cvs = [[np.append(i,non_amsterdam_index),j] for i,j in kf.split(X, y)]我们可以检查一下:
for train,test in cvs:
print("train composition")
print(combined.iloc[train,]["gm_naam"].value_counts())
print("test composition")
print(combined.iloc[test,]["gm_naam"].value_counts())你可以看到这个测试只在阿姆斯特丹:
train composition
Amsterdam 48
Copenhagen 33
Stockholm 21
Name: gm_naam, dtype: int64
test composition
Amsterdam 25
Name: gm_naam, dtype: int64
train composition
Amsterdam 49
Copenhagen 33
Stockholm 21
Name: gm_naam, dtype: int64
test composition
Amsterdam 24
Name: gm_naam, dtype: int64
train composition
Amsterdam 49
Copenhagen 33
Stockholm 21
Name: gm_naam, dtype: int64
test composition
Amsterdam 24
Name: gm_naam, dtype: int64然后勾选这个值:
score = cross_val_score(RandomForestRegressor(random_state= 42),
X = combined.iloc[:, 4:],
y = combined.iloc[:, 3],
cv= cvs, scoring="neg_mean_squared_error")https://stackoverflow.com/questions/67450374
复制相似问题