我有两个数据框架,一个(df_protein)包含来自携带修改的蛋白质片段的实验测量数据,在另一个(df_modificaton)中,我有一个所有修改后的“名称”数据库。现在我正试图把它们合并在一起。
两者都有一个带有修饰序列的列(被修饰的氨基酸有星号)。但是在df_protein中,整个片段的序列(!)存储(开始和结束为""),而在df_modification中,只有在修饰前后的7个氨基酸被给予(如果它是在蛋白质的开始或结束时,其余的地方被标记为"")
为了更好地说明这一点,请参阅MWE:
df_protein <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250)
)
df_modificaton <- data_frame(
Protein = c("A", "A", "A", "B", "B", "B"),
Sequence = c("TIPEQRLS*SSSLLAS", "PSIASDIY*LPIATQ", "PEQRLSSS*SLLASPG", "DPVPPET*PSDSDHK", "FYYEILNS*PEKACSL","_____SMS*VDLSHIP"),
Modification = c("S125", "Y77", "S127", "T456", "S44", "S3")
)
# How can I merge the above to the following result:
df_merged <- data_frame(
Protein = c("A", "A", "A", "B", "B"),
Sequence = c("_EPTPSIASDIY*LPIATQELR_" , "_S*SSSLLASPGHISVK_", "_SSS*SLLASPGHISVK_", "_TQDPVPPET*PSDSDHK_", "_SMS*VDLSHIPLK_") ,
Counts = c(3.456, 6.126, 10.023 ,0.000, 7.250),
Modification = c("Y77", "S125", "S127", "T456", "S3")
) 我正在使用tidyverse,但我也可以使用其他软件包。谢谢。
发布于 2021-01-21 14:56:41
一种方法是使用fuzzyjoin包来执行stringdist连接:
library(dplyr)
library(fuzzyjoin)
stringdist_inner_join(df_protein, df_modificaton,
by = "Sequence", method = "jw", distance_col = "distance") %>%
group_by(Sequence.x) %>%
slice_min(distance)
# A tibble: 5 x 7
# Groups: Sequence.x [5]
Protein.x Sequence.x Counts Protein.y Sequence.y Modification distance
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 A _EPTPSIASDIY*LPIATQELR_ 3.46 A PSIASDIY*LPIATQ Y77 0.260
2 A _S*SSSLLASPGHISVK_ 6.13 A PEQRLSSS*SLLASPG S127 0.294
3 B _SMS*VDLSHIPLK_ 7.25 B _____SMS*VDLSHIP S3 0.15
4 A _SSS*SLLASPGHISVK_ 10.0 A PEQRLSSS*SLLASPG S127 0.294
5 B _TQDPVPPET*PSDSDHK_ 0 B DPVPPET*PSDSDHK T456 0.137https://stackoverflow.com/questions/65829768
复制相似问题