我有两个数据集:
#df1:
Gene interactors
ACE BRCA, HER2
NOS NA, NA
P53 NA
CDON TGBP
df2:
Gene interactors
AGT NOS, HER2
NPKB CDON
P70 GPC
IK TGBP我希望在df1中找出在df2下被列为相互作用的基因,并在df1中通过与df2中的相互作用物相匹配的相互作用来识别基因。
输出:
Gene interactors matched_gene_interactor matched_interactor_interactor
ACE BRCA, HER2 FALSE TRUE
NOS NA, NA TRUE FALSE
P53 NA FALSE FALSE
CDON TGBP TRUE TRUE
#ACE has an interactor (HER2) in both df1 and df2
#NOS matches itself as an interactor in df2
#CDON matches itself as an interactor in df2 and as having an interactor (TGBP) in both df1 and df2我已经能够通过以下方式编写代码以获取matched_gene_interactor列:
df1$matched_gene_interactor <- df1$Gene %in% unlist(strsplit(df2$interactors, ", "))但是,我被困在第二个matched_interactor_interactor列上了
我尝试了一些东西,但没有找到如何将它发展到我想要的第二列的地步,例如:
df1interactors <- unlist(strsplit(df1$interactors, ", "))
df2interactors <- unlist(strsplit(df2$interactors, ", "))
matched_interactor_interactor <- df1interactors %in% df2interactors如何与两个具有未列出字符串拆分的数据集进行匹配?我有生物学背景,所以不知道从哪里开始。
示例输入数据:
df1:
structure(list(Gene = c("ACE", "NOS", "P53", "CDON"), interactors = c("BRCA, HER2",
"NA, NA", NA, "TGBP")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
df2:
structure(list(Gene = c("AGT", "NPKB", "P70", "IK"), interactors = c("NOS, HER2",
"CDON", "GPC", "TGBP")), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))发布于 2020-06-25 08:40:24
您可以在逗号上拆分interactors of df2,并对每一行检查是否存在来自df1的interactors值。
temp <- unlist(strsplit(df2$interactors, ', '))
df1$matched_interactor_interactor <- sapply(strsplit(df1$interactors, ', '),
function(x) any(x %in% temp))
df1
# Gene interactors matched_gene_interactor matched_interactor_interactor
#1: ACE BRCA, HER2 FALSE TRUE
#2: NOS NA, NA TRUE FALSE
#3: P53 <NA> FALSE FALSE
#4: CDON TGBP TRUE TRUE如果df2$interactors不是很大,您也可以通过创建动态regex模式来做到这一点,而无需拆分df1$interactors:
grepl(paste0('\\b', temp, '\\b', collapse = '|'), df1$interactors)
#[1] TRUE FALSE FALSE TRUEhttps://stackoverflow.com/questions/62571125
复制相似问题