我有两个数据帧。第一个数据帧是遗传变异、它们的标识符以及它们在染色体上的位置的列表。第二个是基因列表,其中每行中的列指定了基因在染色体上的起始和终止位置。
我想看看哪些遗传变异属于由start_20和stop_20 cols表示的基因‘范围’。一个遗传变异可能落入一个以上的基因范围内。例如,这里的单核苷酸多态性"rs1“将映射到基因A和基因B。
这就是我到目前为止所尝试的:
基因范围的df
chromosome<-c("1", "1", "2")
start_20<-c("1", "1", "5")
stop_20<-c("4", "4", "6")
gene<-c("A", "B", "C")
genelist=data.frame(chromosome, start_20, stop_20, gene,stringsAsFactors=F )snps的df及其位置
chromosome<-c("1", "2")
snp<-c("rs1", "rs2")
position<-c("3", "5")
snplist=data.frame(chromosome,snp,position,stringsAsFactors=F)目的是通过碱基对位置将snps与基因进行匹配(即snp 1的位置为'3‘,这意味着它映射到基因A和基因B)。
genelist.bychrome <- vector("list", 2)按染色体排列的基因列表。
for(i in 1:2) genelist.bychrome[[i]] <- genelist[genelist[,"chromosome"]==i,] 长度为nrow(snplist)的空容器将匹配的基因放入此处(如果找到的话
gene.matched <- rep("",nrow(snplist))
gene.matched<-as.list(gene.matched)
#looping across each observation in snplist
for(i in 1:nrow(snplist)){
# snplist[i,"chromosome"] is the chromosome of interest
# Because of consecutive ordering genelist.bychrome[[3]] gives the genelist for chromosome 3
Therefore, genelist.bychrome[[ snplist[i,"chromosome"] ]] gives the genelist for the chromosome of interest
VERY IMPORTANT: get.gene gives the index in genelist.bychrome[[ snplist[i,"chromosome"] ]], NOT genelist
if(snplist[i,"chromosome"] <= 1){
get.gene<- which((genelist.bychrome[[ snplist[i,"chromosome"] ]][,"stop_20"] >= snplist[i,"position"]) &
# get matching list element of genelist.bychrome
# in this element collect indices for rows where stop position is greater than the postion of the snp and
# start position is less than the position of the snp
# this should collect multiple rows for some snps
# dump the gene for this index in the matching element of gene.matched
# i.e get.gene<- which(genelist.bychrome[[1]] [,"stop_20"] >= snplist[1,3]) & (genelist.bychrome[[1]] [,"start_20"] <= snplist[1,3])
# gene.matched <- genelist.bychrome[[1]][get.gene,"gene"]
( genelist.bychrome[[ snplist[i,"chromosome"] ]][,"start_20"] <= snplist[i,"position"])) # correct
if(length(get.gene)!=0) gene.matched[i]<- genelist.bychrome[[ snplist[i,"chromosome"] ]][get.gene,"gene"]
}
} # end for()
#bind the matched genes to the snplist
snplist.new <- cbind(snplist,gene.matched)任何建议都将不胜感激!谢谢。
发布于 2016-04-27 02:33:08
我相信您的问题出在For循环中的which语句中。实际上,如果添加as.numeric(),这一行就可以正常工作。试试这个,genelist.bychrome[[as.numeric(snplist[i,"chromosome"]) ]]
但总的来说,我建议将数字向量定义为数字数据类型,除非您有其他理由,例如您的染色体向量可以定义为c(1,1,2)而不是c("1","1","2")。在处理定义为字符串的数字数据时,也要使用as.numeric(),比如引用列表索引、比较操作等。
如果有效,请让我知道。
发布于 2016-04-27 23:54:10
更新:问题已解决,方法是删除第一个if语句,按照建议将向量转换为'as.numeric‘,并在开头创建一个没有预定长度的空列表(我猜这可能并不总是一个好主意)。
谢谢!
#make data frames
genelist = data.frame(chromosome=c(1,1,2),start_20=c(1, 1, 5), stop_20=c(4, 4, 6), gene=c("A", "B", "C"), stringsAsFactors=F)
snplist=data.frame(chromosome=c(1,2),snp=c("rs1", "rs2"),position=c(3,5),stringsAsFactors=F)
#objective is to get genes per snp
genelist.bychrome <- vector("list", 2)
for(i in 1:2) genelist.bychrome[[i]] <- genelist[genelist[,"chromosome"]==i,]
gene.matched <- list()
#looping across each observation in snplist
for(i in 1:nrow(snplist)){
get.gene<- which((genelist.bychrome[[as.numeric(snplist[i,"chromosome"]) ]] [,"stop_20"] >= snplist[i,"position"]) &
(genelist.bychrome[[as.numeric(snplist[i,"chromosome"]) ]] [,"start_20"] <= snplist[i,"position"]))
if(length(get.gene)!=0) gene.matched[[i]]<- genelist.bychrome[[ snplist[i,"chromosome"] ]][get.gene,"gene"]
}
names(gene.matched)=snplist$snphttps://stackoverflow.com/questions/36871392
复制相似问题