首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >tree模型软件包:R中的一般线性模型(glm)和决策树(套袋树、增强树和随机林)模型

tree模型软件包:R中的一般线性模型(glm)和决策树(套袋树、增强树和随机林)模型
EN

Stack Overflow用户
提问于 2020-11-18 11:20:37
回答 1查看 299关注 0票数 1

问题

我试图使用R中的Tidymodels包进行分析。以下是关于R中决策树学习的教程:-

教程

https://bcullen.rbind.io/post/2020-06-02-tidymodels-decision-tree-learning-in-r/

我有一个名为FID (见下文)的数据框架,其中因变量是频率(数字),预测变量是:-年份(数字)、月份(因子)、季风(因子)和天数(数字)。

我相信我已经成功地学习了名为"Tidymodels:决策树学习在R"的教程,构建了一个套袋树、随机森林和增强树模型。

对于这种分析,我还想构造一个一般线性模型(glm),以便在所有模型(即随机森林模型、套袋树模型、增强树模型和一般线性模型)之间进行模型比较,建立最优的拟合模型。所有模型都要经过10倍交叉验证,以减少过度拟合的偏差.

问题

随后,我尝试调整教程中的代码(请参阅下面)以适应glm模型,但我很困惑是否对模型进行了适当的调整。当我试图在模型都适合之后生成rmse值时,我不确定glm代码的这个元素是否正在产生下面的错误消息:-

错误消息

代码语言:javascript
复制
Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.

此外,我不确定这些模型的()函数中表示的R代码是否足够或正确,在对每个模型进行拟合之前,这在处理步骤中非常重要。在我看来,我想知道模型的配方是否可以改进。

如果这是可能的,我想知道是否有人可以帮助我在拟合glm模型时的错误信息,与修正配方(如果这是必要的)。

我不是一个先进的R编码器,我已经做了一个彻底的尝试,试图通过研究其他Tidymodel教程找到一个解决方案;但是,我不明白这个错误信息意味着什么。因此,如果有人能提供帮助,我想表达我最深切的谢意。

在此之前,非常感谢您。

R-代码

代码语言:javascript
复制
##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)

###########################################################
# Put 3/4 of the data into the training set
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df, prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe
##Preprocessing
############################################################

rec <- recipe(Frequency ~ ., data = fid_df) %>% 
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

####Error message

Error: Problem with `mutate()` input `model`.
x Input `model` can't be recycled to size 4.
ℹ Input `model` is `c("bag", "rf", "boost")`.
ℹ Input `model` must be size 4 or 1, not 3.
Run `rlang::last_error()` to see where the error occurred.

#####################################################
##Out-of-sample performance
#####################################################

# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)

数据帧- FID

代码语言:javascript
复制
structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017,
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = "factor"), Monsoon = structure(c(2L,
2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 4L, 4L, 4L, 4L, 4L,
3L, 3L, 2L), .Label = c("First_Inter_Monssoon", "North_Monsoon",
"Second_Inter_Monsoon", "South_Monsson"), class = "factor"),
    Frequency = c(36, 28, 39, 46, 5, 0, 0, 22, 10, 15, 8,
    33, 33, 29, 31, 23, 8, 9, 7, 40, 41, 41, 30, 30, 44, 37,
    41, 42, 20, 0, 7, 27, 35, 27, 43, 38), Days = c(31,
    28, 31, 30, 6, 0, 0, 29, 15, 29, 29, 31, 31, 29, 30, 30,
    7, 0, 7, 30, 30, 31, 30, 27, 31, 28, 30, 30, 21, 0, 7, 26,
    29, 27, 29, 29)), row.names = c(NA, -36L), class = "data.frame")
EN

回答 1

Stack Overflow用户

发布于 2020-11-19 02:31:43

回答朱莉娅·西尔奇的建议

代码语言:javascript
复制
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(Tidy_df)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data)

###########################################################
##Produce the recipe

rec <- recipe(Frequency_Blue ~ ., data = Tidy_df) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

###########################################################
##Create Models
###########################################################

##########################################################
##General Linear Models
#########################################################

##glm
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##########################################################
##Bagged Trees
##########################################################

#####Bagged Trees
mod_bag <- bag_tree() %>%
            set_mode("regression") %>%
             set_engine("rpart", times = 10) #10 bootstrap resamples
                

##Create workflow
wflow_bag <- workflow() %>% 
                   add_recipe(rec) %>%
                       add_model(mod_bag)

##Fit the model
plan(multisession)

fit_bag <- fit_resamples(
                      wflow_bag,
                      cv,
                      metrics = metric_set(rmse, rsq),
                      control = control_resamples(save_pred = TRUE)
                      )

###################################################
##Random forests
###################################################

mod_rf <-rand_forest(trees = 1e3) %>%
                              set_engine("ranger",
                              num.threads = parallel::detectCores(), 
                              importance = "permutation", 
                              verbose = TRUE) %>% 
                              set_mode("regression") 
                              
##Create Workflow

wflow_rf <- workflow() %>% 
               add_model(mod_rf) %>% 
                     add_recipe(rec)

##Fit the model

plan(multisession)

fit_rf<-fit_resamples(
             wflow_rf,
             cv,
             metrics = metric_set(rmse, rsq),
             control = control_resamples(save_pred = TRUE)
             )

############################################################
##Boosted Trees
############################################################

mod_boost <- boost_tree() %>% 
                 set_engine("xgboost", nthreads = parallel::detectCores()) %>% 
                      set_mode("regression")

##Create workflow

wflow_boost <- workflow() %>% 
                  add_recipe(rec) %>% 
                    add_model(mod_boost)

##Fit model

plan(multisession)

fit_boost <-fit_resamples(
                       wflow_boost,
                       cv,
                       metrics = metric_set(rmse, rsq),
                       control = control_resamples(save_pred = TRUE)
                       )

##############################################
##Evaluate the models
##############################################

collect_metrics(fit_bag) %>% 
        bind_rows(collect_metrics(fit_rf)) %>%
          bind_rows(collect_metrics(fit_boost)) %>% 
            bind_rows(collect_metrics(fit_glm)) %>% 
              dplyr::filter(.metric == "rmse") %>% 
                dplyr::mutate(model = c("bag", "rf", "boost", "glm")) %>% 
                 dplyr::select(model, everything()) %>% 
                    knitr::kable()

##rmse values for all models

|model |.metric |.estimator |      mean|  n|  std_err|
|:-----|:-------|:----------|---------:|--:|--------:|
|bag   |rmse    |standard   |  8.929936| 10| 1.544587|
|rf    |rmse    |standard   | 10.188710| 10| 1.144354|
|boost |rmse    |standard   |  9.249904| 10| 1.489482|
|glm   |rmse    |standard   | 11.348420| 10| 2.217807|

#####################################################
##Out-of-sample performance
#####################################################
#glm

# bagged trees
final_fit_glm <- last_fit(
                     wflow_glm,
                        split = split)


# bagged trees
final_fit_bag <- last_fit(
                     wflow_bag,
                       split = split)
# random forest
final_fit_rf <- last_fit(
                  wflow_rf,
                    split = split)
# boosted trees
final_fit_boost <- last_fit(
                      wflow_boost,
                          split = split)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64892060

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档