文章/答案/技术大牛

发布

社区首页 >问答首页 >利用时序模型提高不平衡数据集的查全率/查全率

问利用时序模型提高不平衡数据集的查全率/查全率
EN

Stack Overflow用户

提问于 2022-09-25 19:16:42

回答 1查看 69关注 0票数 0

我有以下代码：

library(tidyverse)
library(tidymodels)
library(readxl)
library(themis)
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time


df <- read_csv("https://raw.githubusercontent.com/norhther/datasets/main/pacientes.csv")

sapply(df, function(x) sum(is.na(x))/nrow(df))


df <- df %>%
  select(ICC, CPI, Valvulopatía,
         `Vascuopatía periférica`, ACxFA,
         HTA, EPOC, `AVC/AIT`, Demencia,
         `Osteoporosis/Fx patol`, Neoplasia,
         IRC, DBT, `Numero antecedentes`, Hb,
         `Nº medicacion habitual`, `Reserva Funcional (METS)`,
         `MNA (Nutricion)`, MMSE, Fragilidad, Dependencia,
         `Normal Gait speed`, Charlson, ASA, Complejidad,
         `Sobrevida al mes`, `Mujer`, `Vive con`, `Especialidad`,
         `Tipo de Anestesia`, `Transfusión`, `Creat`, `INR`)


df <- df %>%
  mutate(across(1:13, as_factor)) %>%
  mutate(Complejidad = as_factor(Complejidad),
         `Sobrevida al mes` = as_factor(`Sobrevida al mes`),
         Mujer = as_factor(Mujer),
         Especialidad = as_factor(Especialidad),
         `Tipo de Anestesia` = as_factor(`Tipo de Anestesia`),
         Transfusión = as_factor(Transfusión))

df <- df %>%
  mutate(`Vive con` = as.numeric(`Vive con`)) %>%
  drop_na()


data_split <- initial_split(df, prop = 0.8, strata = `Sobrevida al mes`)
train_data <- training(data_split)
test_data <- testing(data_split)

rec <- recipe(`Sobrevida al mes` ~ ., data = train_data)

prep_recipe <- rec %>% prep()

cv <- vfold_cv(train_data)

set.seed(100)


mod_rf <-rand_forest() %>%
  set_engine("ranger",
             num.threads = parallel::detectCores(), 
             importance = "permutation", 
             verbose = TRUE) %>% 
  set_mode("classification") %>% 
  set_args(trees = 1000)

wflow_bag <- workflow() %>% 
  add_recipe(rec) %>%
  add_model(mod_rf)

plan(multisession)

fit_rf <- fit_resamples(
  wflow_bag,
  cv,
  metrics = metric_set(accuracy, kap, recall, precision),
  control = control_resamples(verbose = TRUE,
                              save_pred = TRUE,
                              extract = function(x) x)
)

collect_metrics(fit_rf)

但是，由于数据集是不平衡的，因此我得到了以下结果：

# A tibble: 4 × 6
  .metric   .estimator    mean     n std_err .config             
  <chr>     <chr>        <dbl> <int>   <dbl> <chr>               
1 accuracy  binary       0.953    10  0.0223 Preprocessor1_Model1
2 kap       binary       0         4  0      Preprocessor1_Model1
3 precision binary     NaN         0 NA      Preprocessor1_Model1
4 recall    binary       0         4  0      Preprocessor1_Model1

我想改进这些度量标准，但是我已经尝试过使用SMOTE，但是它没有很好地工作(使用themis库)。我也不知道我对as_factor的解释是否正确。我在二进制因子上使用了它，但是也有1-10的标度，我不知道它应该是一个有序的因子，还是可以作为一个数值。

tidyverse

tidymodels

回答 1

Stack Overflow用户

发布于 2022-09-26 12:44:37

我看不出你的代码有什么问题。

在186项训练中，你的训练项目有11项。一旦它被重放，这个数字就会响，你就会得到以下的笔记：

在计算二进制precision()时，没有检测到预测的事件(即true_positive + false_positive = 0)。在这种情况下，精度是未定义的，并且将返回NA。注意，有问题的事件级别'0‘实际发生了2个真事件。

您可以尝试themis包装中的一个重采样工具来增加事件的数量，但是，对于这些数据，您可能不得不将性能期望降低。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73847233

复制

相似问题

问利用时序模型提高不平衡数据集的查全率/查全率
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问利用时序模型提高不平衡数据集的查全率/查全率EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问利用时序模型提高不平衡数据集的查全率/查全率
EN