首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何基于is.na()和匹配条件一次连接多个列?

如何基于is.na()和匹配条件一次连接多个列?
EN

Stack Overflow用户
提问于 2021-07-12 10:07:20
回答 1查看 71关注 0票数 0

我有大量的蛋白质组数据集。我在找一辆dplyr-solution。

a是从两个独立的数据集中合并的,其中一个是b

代码语言:javascript
复制
> tail(a)
# A tibble: 6 x 5
  Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
  <chr>                <chr>                 <chr>       <chr>         <chr>     
1 NA                   Q9Y2X3                NA          NA            NA        
2 NA                   Q9Y3B4                NA          NA            NA        
3 NA                   Q9Y3I0                NA          NA            NA        
4 NA                   Q9Y4P9                NA          NA            NA        
5 NA                   Q9Y696                NA          NA            NA        
6 NA                   Q9Y6C9                NA          NA            NA

代码语言:javascript
复制
> tail(b)
# A tibble: 6 x 5
  Majority_protein_I… Majority_protein_I… Protein_IDs  Protein_names                  Gene_names
  <chr>               <chr>               <chr>        <chr>                          <chr>     
1 Q9Y617              Q9Y617              Q9Y617       Phosphoserine aminotransferase PSAT1     
2 Q9Y646              Q9Y646              Q9Y646       Carboxypeptidase Q             CPQ       
3 Q9Y696              Q9Y696              Q9Y696       Chloride intracellular channe… CLIC4     
4 Q9Y6C9              Q9Y6C9              Q9Y6C9       Mitochondrial carrier homolog… MTCH2     
5 Q9Y6N7              Q9Y6N7              Q9Y6N7;Q9HC… Roundabout homolog 1           ROBO1     
6 Q9Y6R7              Q9Y6R7              Q9Y6R7       IgGFc-binding protein          FCGBP  

如您所见,许多NA存在于a中,唯一已知的信息是a$Majority_protein_IDs_

我希望从b中提取此信息,以在a中填写NA,以便从b中填充a中所有列中的所有NA行。

有点像

如果d26与b$Majority_protein_IDs、b$Protein_IDs、b$Protein_names和b$Gene_names

  • Keep中的所有行都与D33匹配,则
  • >D26、D28、D29和D33中的所有行都与D34或从b

匹配。

我已经尝试过left_join和ifelse()的一些变体;但是,我还没有成功。

预期输出

代码语言:javascript
复制
> tail(a)
# A tibble: 6 x 5
  Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
  <chr>                <chr>                 <chr>       <chr>         <chr>     
1 NA                   Q9Y2X3                NA          NA            NA        
2 NA                   Q9Y3B4                NA          NA            NA        
3 NA                   Q9Y3I0                NA          NA            NA        
4 Q9Y6C9               Q9Y6C9                Q9Y6C9      Mitochondrial carrier homolog… MTCH2        
5 Q9Y696               Q9Y696                Q9Y696      Chloride intracellular channe… CLIC4        
6 NA                   Q9Y6C9                NA          NA            NA

数据

代码语言:javascript
复制
a <- structure(list(Majority_protein_IDs = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9", 
    "Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), Protein_names = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame")) 

代码语言:javascript
复制
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696", 
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617", 
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617", 
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase", 
"Carboxypeptidase Q", "Chloride intracellular channel protein 4", 
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1", 
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
"data.frame"))

更新

两个数据集a和b不一定有相同的列数。因此,该解决方案必须与a和b之间的不同列数兼容。

假设a看起来像这样

代码语言:javascript
复制
> tail(a)
# A tibble: 6 x 7
  Intensity_CH1 Intensity_CH10 Majority_protei… Majority_protei… Protein_IDs Protein_names
  <chr>         <chr>          <chr>            <chr>            <chr>       <chr>        
1 NaN           NaN            NA               Q9Y2X3           NA          NA           
2 NaN           NaN            NA               Q9Y3B4           NA          NA           
3 NaN           NaN            NA               Q9Y3I0           NA          NA           
4 NaN           NaN            NA               Q9Y4P9           NA          NA           
5 NaN           NaN            NA               Q9Y696           NA          NA           
6 NaN           NaN            NA               Q9Y6C9           NA          NA           
# … with 1 more variable: Gene_names <chr>

而b没有改变

代码语言:javascript
复制
> tail(b)
# A tibble: 6 x 5
  Majority_protein_I… Majority_protein_I… Protein_IDs  Protein_names                  Gene_names
  <chr>               <chr>               <chr>        <chr>                          <chr>     
1 Q9Y617              Q9Y617              Q9Y617       Phosphoserine aminotransferase PSAT1     
2 Q9Y646              Q9Y646              Q9Y646       Carboxypeptidase Q             CPQ       
3 Q9Y696              Q9Y696              Q9Y696       Chloride intracellular channe… CLIC4     
4 Q9Y6C9              Q9Y6C9              Q9Y6C9       Mitochondrial carrier homolog… MTCH2     
5 Q9Y6N7              Q9Y6N7              Q9Y6N7;Q9HC… Roundabout homolog 1           ROBO1     
6 Q9Y6R7              Q9Y6R7              Q9Y6R7       IgGFc-binding protein          FCGBP 

数据

代码语言:javascript
复制
a <- structure(list(Intensity_CH1 = c("NaN", "NaN", "NaN", "NaN", 
"NaN", "NaN"), Intensity_CH10 = c("NaN", "NaN", "NaN", "NaN", 
"NaN", "NaN"), Majority_protein_IDs = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9", 
    "Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), Protein_names = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

代码语言:javascript
复制
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696", 
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617", 
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617", 
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase", 
"Carboxypeptidase Q", "Chloride intracellular channel protein 4", 
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1", 
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
"data.frame"))
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-07-12 11:05:21

我认为在这种情况下,一个基本的解决方案可能更有帮助。dplyr有一些方便的特性,在这里没有帮助,可能会使解决方案更加复杂。

试试这个:

代码语言:javascript
复制
# which protein ids are missing in a (missing means here na in the column Protein_IDs )
missing_prot_ids <- unique(a[is.na(a$Protein_IDs),][["Majority_protein_IDs_"]])

# select every row in b handling those protein ids
b_selected <- b[b$Majority_protein_IDs_ %in% missing_prot_ids, ]

# append the selected rows to the a dataframe, return resulting df
# use bind_rows in order to bind dataframes with different columns
# cols which are not in the other frame are imputed as NA
res_df <- bind_rows(a,b_selected)

现在,除了additonal col之外,它确实与预期的输出完全匹配。我将本专栏添加到a中,以演示bind_rows的行为

代码语言:javascript
复制
# A tibble: 8 x 6
  Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names                            Gene_names additional_col
  <chr>                <chr>                 <chr>       <chr>                                    <chr>               <dbl>
1 NA                   Q9Y2X3                NA          NA                                       NA                      2
2 NA                   Q9Y3B4                NA          NA                                       NA                      2
3 NA                   Q9Y3I0                NA          NA                                       NA                      2
4 NA                   Q9Y4P9                NA          NA                                       NA                      2
5 NA                   Q9Y696                NA          NA                                       NA                      2
6 NA                   Q9Y6C9                NA          NA                                       NA                      2
7 Q9Y696               Q9Y696                Q9Y696      Chloride intracellular channel protein 4 CLIC4                  NA
8 Q9Y6C9               Q9Y6C9                Q9Y6C9      Mitochondrial carrier homolog 2          MTCH2                  NA

现在是多余的,但对于注释上下文来说很重要:您可能会评论为什么您的预期输出对于ID Q9Y696没有价值。我觉得应该符合你的标准。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68345523

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档