首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >比较机器学习算法的性能以预测泰坦尼克号生还的可能性

比较机器学习算法的性能以预测泰坦尼克号生还的可能性
EN

Stack Overflow用户
提问于 2020-07-03 02:35:34
回答 1查看 94关注 0票数 0

我正在尝试理解关于ML模型的guide,以预测某人在泰坦尼克号沉没后幸存的可能性。

我被困在21号牢房了。它基本上是试图比较21种不同的ML算法在拆分数据后的性能。因此,最终结果将如下所示:

Expected result cell 21, if run correctly

单元格21:

代码语言:javascript
复制
# Machine Learning Algorithm (MLA) Selection and Initialization
MLA = [
    # Ensemble Methods
    ensemble.AdaBoostClassifier(), 
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(), 
    ensemble.GradientBoostingClassifier(), 
    ensemble.RandomForestClassifier(), 
    
    # Gaussian Processes
    gaussian_process.GaussianProcessClassifier(), 
    
    # GLM
    linear_model.LogisticRegressionCV(), 
    linear_model.PassiveAggressiveClassifier(), 
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    
    # Navies Bayes
    naive_bayes.BernoulliNB(),
    naive_bayes.GaussianNB(),
    
    # Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    
    # SVM
    svm.SVC(probability = True), 
    svm.NuSVC(probability = True), 
    svm.LinearSVC(), 
    
    # Trees
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    
    # Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),
    
    # xgboost
    XGBClassifier()
]

# Split dataset in cross-validation with this splitter class
# note: this is an alternative to train_test_split
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0)
# run model 10x with split 60/30 split intentionally leaving 10%

# Create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters', 'MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 
               'MLA Test Accuracy 3*STD', 'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

# Create table to compare MLA predictions
MLA_predict = data1[Target]

# Index through MLA and save performance to table
row_index = 0
for alg in MLA:
    # set name and parameters
    MLA_name = alg.__class__.__name__
    MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
    MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())
    
    # score model with cross validation
    cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split)
    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    print(cv_results.keys())
    MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()
    
    # If this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically
    # capture 99.7% of the subsets.
    MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3
    # Let's know the worst that can happen!
    
    # Save MLA predictions
    alg.fit(data1[data1_x_bin], data1[Target])
    MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])
    
    row_index+=1

# Print and sort table
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare
# MLA_predict

在运行它之后,我得到以下错误:

代码语言:javascript
复制
dict_keys(['fit_time', 'score_time', 'test_score'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-cbe9dc24e1e0> in <module>
     67     MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
     68     print(cv_results.keys())
---> 69     MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
     70     MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()
     71 

KeyError: 'train_score'

如你所见,'train_score‘甚至不是作为cv_results.keys()而存在的。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-07-05 07:24:13

根据要返回的train_score列的sklearn.model_selection.cross_validate文档,需要将return_train_score指定为true,如下所示:

代码语言:javascript
复制
cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split, return_train_score=True)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62702952

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档