我知道,为了检验模型是否过拟合,我们需要获得训练和测试数据集的分数,并对它们进行比较。问题是如何将其转化为编码?因为我是第一次这样做,所以我做了一些搜索,在这里我遇到了一个古老的答案:(验证过拟合或过拟合)。
注意:在原来的答案from sklearn.cross_validation import KFold from sklearn.model_selection import KFold 中,我已将这一行替换为:from sklearn.model_selection import KFold
然而,当我运行这个程序时,我会得到下面的错误。怎么解决这个问题?
X_normalized, y_for_normalized = scaled_df[[ "Part's Z-Height (mm)","Part's Solid Volume (cm^3)","Layer Height (mm)","Printing/Scanning Speed (mm/s)","Part's Orientation (Support's volume) (cm^3)"]], scaled_df [["Climate change (kg CO2 eq.)","Climate change, incl biogenic carbon (kg CO2 eq.)","Fine Particulate Matter Formation (kg PM2.5 eq.)","Fossil depletion (kg oil eq.)","Freshwater Consumption (m^3)","Freshwater ecotoxicity (kg 1,4-DB eq.)","Freshwater Eutrophication (kg P eq.)","Human toxicity, cancer (kg 1,4-DB eq.)","Human toxicity, non-cancer (kg 1,4-DB eq.)","Ionizing Radiation (Bq. C-60 eq. to air)","Land use (Annual crop eq. yr)","Marine ecotoxicity (kg 1,4-DB eq.)","Marine Eutrophication (kg N eq.)","Metal depletion (kg Cu eq.)","Photochemical Ozone Formation, Ecosystem (kg NOx eq.)","Photochemical Ozone Formation, Human Health (kg NOx eq.)","Stratospheric Ozone Depletion (kg CFC-11 eq.)","Terrestrial Acidification (kg SO2 eq.)","Terrestrial ecotoxicity (kg 1,4-DB eq.)"]]
new_model = DecisionTreeRegressor(max_depth=9,
min_samples_split=10,random_state=0)
import numpy as np
from sklearn.metrics import SCORERS
from sklearn.model_selection import KFold
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv.split(X_normalized):
new_model.fit(X[train], y[train])
train_scores.append(scorer(new_model, X[train], y[train]))
test_scores.append(scorer(new_model, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/var/folders/mm/r4gnnwl948zclfyx12w803040000gn/T/ipykernel_73165/4218536717.py in <module>
7 train_scores, test_scores = [], []
8 for train, test in cv.split(X_normalized):
----> 9 new_model.fit(X[train], y[train])
10 train_scores.append(scorer(new_model, X[train], y[train]))
11 test_scores.append(scorer(new_model, X[test], y[test]))
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3462 if is_iterator(key):
3463 key = list(key)
-> 3464 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3465
3466 # take() does not accept boolean indexers
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1372 if use_interval_msg:
1373 key = list(key)
-> 1374 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
KeyError: "None of [Int64Index([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,\n 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,\n 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,\n 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,\n 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],\n dtype='int64')] are in the [columns]"发布于 2022-08-16 05:27:08
根据用户手册,您需要调用cv.split(X)来获得迭代器:
for train, test in cv.split(X):
regressor.fit(X[train], y[train])
...还请记住,train和test是索引列表。对于numpy来说,这不是一个问题(从某种意义上说,用索引列表而不是单个索引来索引将产生您期望的结果),但是对于常规的list或可索引的一般对象来说不是这样。
发布于 2022-08-16 05:45:57
X_normalized, y_for_normalized = scaled_df[[ "Part's Z-Height (mm)","Part's Solid Volume (cm^3)","Layer Height (mm)","Printing/Scanning Speed (mm/s)","Part's Orientation (Support's volume) (cm^3)"]], scaled_df [["Climate change (kg CO2 eq.)","Climate change, incl biogenic carbon (kg CO2 eq.)","Fine Particulate Matter Formation (kg PM2.5 eq.)","Fossil depletion (kg oil eq.)","Freshwater Consumption (m^3)","Freshwater ecotoxicity (kg 1,4-DB eq.)","Freshwater Eutrophication (kg P eq.)","Human toxicity, cancer (kg 1,4-DB eq.)","Human toxicity, non-cancer (kg 1,4-DB eq.)","Ionizing Radiation (Bq. C-60 eq. to air)","Land use (Annual crop eq. yr)","Marine ecotoxicity (kg 1,4-DB eq.)","Marine Eutrophication (kg N eq.)","Metal depletion (kg Cu eq.)","Photochemical Ozone Formation, Ecosystem (kg NOx eq.)","Photochemical Ozone Formation, Human Health (kg NOx eq.)","Stratospheric Ozone Depletion (kg CFC-11 eq.)","Terrestrial Acidification (kg SO2 eq.)","Terrestrial ecotoxicity (kg 1,4-DB eq.)"]]
new_model = DecisionTreeRegressor(max_depth=9,
min_samples_split=10,random_state=0)
import numpy as np
from sklearn.metrics import SCORERS
from sklearn.model_selection import KFold
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv.split(X_normalized):
new_model.fit(X_normalized[train], y_for_normalized[train])
train_scores.append(scorer(new_model, X_normalized[train], y_for_normalized[train]))
test_scores.append(scorer(new_model, X_normalized[test], y_for_normalized[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)https://stackoverflow.com/questions/73368997
复制相似问题