首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何优雅地跨多个列并有条件地填充新列?

如何优雅地跨多个列并有条件地填充新列?
EN

Stack Overflow用户
提问于 2021-05-13 06:28:08
回答 1查看 920关注 0票数 2

正如你所看到的,我正在处理一些严重的肮脏数据。这段代码可以工作,但看起来有点笨重。是否有一种更有效和更动态的方法来实现最终的结果而不需要这么多的编码?

我必须分阶段这样做,首先标记内容类型,然后使用内容类型将它们填充到相应的列类型中。

感谢你的帮助

代码语言:javascript
复制
#load library
library(dplyr)
library(stringr)
library(lubridate)

#create sample data
df <- tibble(c1 = c('9996155', '4001096', '4001525', '4000590','2020-01-23', '2019-12-23', '2020-01-20', '2019-12-08'),
             c2 = c('4001902', '5000009', '2020-01-23', '2019-12-23', '2020-01-20', '2019-12-08', '4000461', '4000311'),
             c3 = c('W-7', 'W-8', 'W-9', 'W-2', 'W-1', 'W-1','3.527E+20', '3.498E+20'),
             c4 = c('B09/20', 'B04/20', 'B05/20', 'B10/20', 'B06/20',  '3.408E+20', '3.229E+20', '3.225E+20')
             )

数据如下所示

代码语言:javascript
复制
> df
# A tibble: 8 x 4
  c1         c2         c3        c4       
  <chr>      <chr>      <chr>     <chr>    
1 9996155    4001902    W-7       B09/20   
2 4001096    5000009    W-8       B04/20   
3 4001525    2020-01-23 W-9       B05/20   
4 4000590    2019-12-23 W-2       B10/20   
5 2020-01-23 2020-01-20 W-1       B06/20   
6 2019-12-23 2019-12-08 W-1       3.408E+20
7 2020-01-20 4000461    3.527E+20 3.229E+20
8 2019-12-08 4000311    3.498E+20 3.225E+20

我做过这样的事情是为了使它保持良好的状态

代码语言:javascript
复制
df %>%
  mutate(across(#flag them now to allow next step for data population
    starts_with('c'),
    ~ case_when(
      is.na(.) ~ NA_character_,
      str_detect(., regex('(^20[1,2][0-9]\\-)|(\\/20[1,2][0-9]$)')) ~ 'date',
      str_detect(., regex('\\d\\.\\d{3}[eE][+-]\\d{2}+')) ~ 'numericScientificNotation',
      str_detect(.,regex('(^[a-zA-Z][0-9]{2}\\/2[0-1]{1}$)|(^[A-Z]{1,2}\\-\\d.*[a-zA-Z]*$)|(^[a-zA-Z][0-9]{2})|(^[A-Z][0-9]$)')) ~ 'batches',
      str_detect(., regex('^-?\\d+$')) ~ 'integers',
      TRUE ~ NA_character_
    ),
    .names = paste0('test', "_{col}")
  )) %>% #casewhen to populate new columns
  mutate(integer = case_when(test_c1 == 'integers' ~ c1,
                             test_c2 == 'integers' ~ c2,
                             test_c3 == 'integers' ~ c3,
                             test_c4 == 'integers' ~ c4),
         date = case_when(test_c1 == 'date' ~ c1,
                             test_c2 == 'date' ~ c2,
                             test_c3 == 'date' ~ c3,
                             test_c4 == 'date' ~ c4),
         batches = case_when(test_c1 == 'batches' ~ c1,
                               test_c2 == 'batches' ~ c2,
                               test_c3 == 'batches' ~ c3,
                               test_c4 == 'batches' ~ c4),
         numericScientificNotation = case_when(test_c1 == 'numericScientificNotation' ~ c1,
                               test_c2 == 'numericScientificNotation' ~ c2,
                               test_c3 == 'numericScientificNotation' ~ c3,
                               test_c4 == 'numericScientificNotation' ~ c4)
         ) %>% 
  select(9:12) #this is all that i need

只需要这个有组织的输出。

谢谢!

代码语言:javascript
复制
# A tibble: 8 x 4
  integer date       batches numericScientificNotation
  <chr>   <chr>      <chr>   <chr>                    
1 9996155 NA         W-7     NA                       
2 4001096 NA         W-8     NA                       
3 4001525 2020-01-23 W-9     NA                       
4 4000590 2019-12-23 W-2     NA                       
5 NA      2020-01-23 W-1     NA                       
6 NA      2019-12-23 W-1     3.408E+20                
7 4000461 2020-01-20 NA      3.527E+20                
8 4000311 2019-12-08 NA      3.498E+20                
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-05-13 06:52:09

这里有一种方法可以简化这个过程并减少重复:

代码语言:javascript
复制
library(dplyr)

regex_list <- list(date = '(^20[1,2][0-9]\\-)|(\\/20[1,2][0-9]$)', 
                  numericScientificNotation = '\\d\\.\\d{3}[eE][+-]\\d{2}+', 
                  batches = '(^[a-zA-Z][0-9]{2}\\/2[0-1]{1}$)|(^[A-Z]{1,2}\\-\\d.*[a-zA-Z]*$)|(^[a-zA-Z][0-9]{2})|(^[A-Z][0-9]$)', 
                  integers = '^-?\\d+$')


purrr::imap_dfc(regex_list, function(x, y) 
                  df %>%
                    mutate(across(.fns = ~ifelse(str_detect(.x, x), .x, NA))) %>%
                    transmute(!!y := do.call(coalesce, .)))

#  date       numericScientificNotation batches integers
#  <chr>      <chr>                     <chr>   <chr>   
#1 NA         NA                        W-7     9996155 
#2 NA         NA                        W-8     4001096 
#3 2020-01-23 NA                        W-9     4001525 
#4 2019-12-23 NA                        W-2     4000590 
#5 2020-01-23 NA                        W-1     NA      
#6 2019-12-23 3.408E+20                 W-1     NA      
#7 2020-01-20 3.527E+20                 NA      4000461 
#8 2019-12-08 3.498E+20                 NA      4000311 
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67514807

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档