我有大量的蛋白质组数据集。我在找一辆dplyr-solution。
a是从两个独立的数据集中合并的,其中一个是b。
> tail(a)
# A tibble: 6 x 5
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 NA Q9Y2X3 NA NA NA
2 NA Q9Y3B4 NA NA NA
3 NA Q9Y3I0 NA NA NA
4 NA Q9Y4P9 NA NA NA
5 NA Q9Y696 NA NA NA
6 NA Q9Y6C9 NA NA NA和
> tail(b)
# A tibble: 6 x 5
Majority_protein_I… Majority_protein_I… Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 Q9Y617 Q9Y617 Q9Y617 Phosphoserine aminotransferase PSAT1
2 Q9Y646 Q9Y646 Q9Y646 Carboxypeptidase Q CPQ
3 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y6N7 Q9Y6N7 Q9Y6N7;Q9HC… Roundabout homolog 1 ROBO1
6 Q9Y6R7 Q9Y6R7 Q9Y6R7 IgGFc-binding protein FCGBP 如您所见,许多NA存在于a中,唯一已知的信息是a$Majority_protein_IDs_
我希望从b中提取此信息,以在a中填写NA,以便从b中填充a中所有列中的所有NA行。
有点像
如果d26与b$Majority_protein_IDs、b$Protein_IDs、b$Protein_names和b$Gene_names
Keep中的所有行都与D33匹配,则>D26、D28、D29和D33中的所有行都与D34或从b匹配。
我已经尝试过left_join和ifelse()的一些变体;但是,我还没有成功。
预期输出
> tail(a)
# A tibble: 6 x 5
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 NA Q9Y2X3 NA NA NA
2 NA Q9Y3B4 NA NA NA
3 NA Q9Y3I0 NA NA NA
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
6 NA Q9Y6C9 NA NA NA数据
a <- structure(list(Majority_protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9",
"Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Protein_names = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame")) 和
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696",
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase",
"Carboxypeptidase Q", "Chloride intracellular channel protein 4",
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1",
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))更新
两个数据集a和b不一定有相同的列数。因此,该解决方案必须与a和b之间的不同列数兼容。
假设a看起来像这样
> tail(a)
# A tibble: 6 x 7
Intensity_CH1 Intensity_CH10 Majority_protei… Majority_protei… Protein_IDs Protein_names
<chr> <chr> <chr> <chr> <chr> <chr>
1 NaN NaN NA Q9Y2X3 NA NA
2 NaN NaN NA Q9Y3B4 NA NA
3 NaN NaN NA Q9Y3I0 NA NA
4 NaN NaN NA Q9Y4P9 NA NA
5 NaN NaN NA Q9Y696 NA NA
6 NaN NaN NA Q9Y6C9 NA NA
# … with 1 more variable: Gene_names <chr>而b没有改变
> tail(b)
# A tibble: 6 x 5
Majority_protein_I… Majority_protein_I… Protein_IDs Protein_names Gene_names
<chr> <chr> <chr> <chr> <chr>
1 Q9Y617 Q9Y617 Q9Y617 Phosphoserine aminotransferase PSAT1
2 Q9Y646 Q9Y646 Q9Y646 Carboxypeptidase Q CPQ
3 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channe… CLIC4
4 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog… MTCH2
5 Q9Y6N7 Q9Y6N7 Q9Y6N7;Q9HC… Roundabout homolog 1 ROBO1
6 Q9Y6R7 Q9Y6R7 Q9Y6R7 IgGFc-binding protein FCGBP 数据
a <- structure(list(Intensity_CH1 = c("NaN", "NaN", "NaN", "NaN",
"NaN", "NaN"), Intensity_CH10 = c("NaN", "NaN", "NaN", "NaN",
"NaN", "NaN"), Majority_protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Majority_protein_IDs_ = c("Q9Y2X3", "Q9Y3B4", "Q9Y3I0", "Q9Y4P9",
"Q9Y696", "Q9Y6C9"), Protein_IDs = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Protein_names = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Gene_names = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))和
b <- structure(list(Majority_protein_IDs = c("Q9Y617", "Q9Y646", "Q9Y696",
"Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Majority_protein_IDs_ = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7", "Q9Y6R7"), Protein_IDs = c("Q9Y617",
"Q9Y646", "Q9Y696", "Q9Y6C9", "Q9Y6N7;Q9HCK4", "Q9Y6R7"), Protein_names = c("Phosphoserine aminotransferase",
"Carboxypeptidase Q", "Chloride intracellular channel protein 4",
"Mitochondrial carrier homolog 2", "Roundabout homolog 1", "IgGFc-binding protein"
), Gene_names = c("PSAT1", "CPQ", "CLIC4", "MTCH2", "ROBO1",
"FCGBP")), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))发布于 2021-07-12 11:05:21
我认为在这种情况下,一个基本的解决方案可能更有帮助。dplyr有一些方便的特性,在这里没有帮助,可能会使解决方案更加复杂。
试试这个:
# which protein ids are missing in a (missing means here na in the column Protein_IDs )
missing_prot_ids <- unique(a[is.na(a$Protein_IDs),][["Majority_protein_IDs_"]])
# select every row in b handling those protein ids
b_selected <- b[b$Majority_protein_IDs_ %in% missing_prot_ids, ]
# append the selected rows to the a dataframe, return resulting df
# use bind_rows in order to bind dataframes with different columns
# cols which are not in the other frame are imputed as NA
res_df <- bind_rows(a,b_selected)现在,除了additonal col之外,它确实与预期的输出完全匹配。我将本专栏添加到a中,以演示bind_rows的行为
# A tibble: 8 x 6
Majority_protein_IDs Majority_protein_IDs_ Protein_IDs Protein_names Gene_names additional_col
<chr> <chr> <chr> <chr> <chr> <dbl>
1 NA Q9Y2X3 NA NA NA 2
2 NA Q9Y3B4 NA NA NA 2
3 NA Q9Y3I0 NA NA NA 2
4 NA Q9Y4P9 NA NA NA 2
5 NA Q9Y696 NA NA NA 2
6 NA Q9Y6C9 NA NA NA 2
7 Q9Y696 Q9Y696 Q9Y696 Chloride intracellular channel protein 4 CLIC4 NA
8 Q9Y6C9 Q9Y6C9 Q9Y6C9 Mitochondrial carrier homolog 2 MTCH2 NA现在是多余的,但对于注释上下文来说很重要:您可能会评论为什么您的预期输出对于ID Q9Y696没有价值。我觉得应该符合你的标准。
https://stackoverflow.com/questions/68345523
复制相似问题