我有一个巨大的data.frame,它有大约300万行和100列。在其中一列中,有关于ID的信息。我需要创建一个R脚本,它可以用于基于这个ID生成一个新的data.frame。基本上,这个新的data.frame将只包含找到这个ID的行和来自大data.frame的其余列,再加上一些需要根据ID column中包含的信息创建的额外列。最后一个想法是将其实现到一个闪亮的应用程序中,这样用户就可以输入名称ID,然后新的data.frame就可以可视化了。
下面是我的数据的一个例子。这将是我需要按示例拆分的大型data.frame:
Chr Start End Ref Alt Callers GATK_Illumina.counts GATK_Illumina.samples GATK_SOLiD.counts GATK_SOLiD.samples LIFE_SOLiD.counts LIFE_SOLiD.samples TVC_Ion.counts TVC_Ion.samples Func.refGene
1 14653 14653 C T GATK_Illumina 5.38 17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99
1 14677 14677 G A GATK_Illumina 2.38 1H321(het)16;6;10;VQSRTrancheSNP99.90to100.00|1K210(het)24;18;6;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_ex
1 14815 14815 C T GATK_Illumina 1.38 1H321(het)14;12;2;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic;downstream WASH7P;DDX11L1 dist=406
1 14825 14825 G A GATK_Illumina 1.38 1H321(het)13;11;2;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic;downstream WASH7P;DDX11L1 dist=416
1 14907 14907 A G GATK_Illumina 6.38 17J965(het)57;40;17;VQSRTrancheSNP99.90to100.00|17L166(het)26;15;11;VQSRTrancheSNP99.00to99.90|1H321(het)27;14;13;VQSRTrancheSNP99.90to100.00|1K210(het)42;24;18;VQSRTrancheSNP9
1 14930 14930 A G GATK_Illumina 6.38 17J965(het)82;60;22;VQSRTrancheSNP99.90to100.00|17L166(het)38;23;15;VQSRTrancheSNP99.00to99.90|1H321(het)31;17;14;VQSRTrancheSNP99.00to99.90|1K210(het)52;28;24;VQSRTrancheSNP99
1 14933 14933 G A GATK_Illumina 2.38 17J965(het)88;76;12;VQSRTrancheSNP99.90to100.00|5G540B(het)77;57;20;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_in
1 14948 14948 G A GATK_Illumina 1.38 5G540B(het)75;63;12;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_intronic;downstream WASH7P;DDX11L1 dist=539
1 14976 14976 G A GATK_Illumina 1.38 5G540B(het)62;50;12;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic;downstream WASH7P;DDX11L1 dist=567
1 15903 15903 - C GATK_Illumina 1.38 1K210(hom)2;0;2;VQSRTrancheINDEL99.00to99.90 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_exonic WASH7P ncRNA_ex
1 16495 16495 G C GATK_Illumina 5.38 17L166(het)80;57;23;VQSRTrancheSNP99.90to100.00|1H321(het)48;21;27;VQSRTrancheSNP99.90to100.00|1K210(het)59;38;21;VQSRTrancheSNP99.90to100.00|5G540B(het)95;77;18;VQSRTrancheSNP
1 16497 16497 A G GATK_Illumina 5.38 17J965(het)54;37;17;VQSRTrancheSNP99.90to100.00|17L166(het)74;61;13;VQSRTrancheSNP99.90to100.00|1K210(het)48;39;9;VQSRTrancheSNP99.90to100.00|5G540B(het)86;66;20;VQSRTrancheSNP
1 16534 16534 C T GATK_Illumina 5.38 17J965(het)19;12;7;VQSRTrancheSNP99.90to100.00|17L166(het)10;4;6;VQSRTrancheSNP99.90to100.00|1K210(het)8;4;4;VQSRTrancheSNP99.90to100.00|5G540B(het)14;7;7;VQSRTrancheSNP99.90to
1 16571 16571 G A GATK_Illumina 6.38 17J965(het)55;31;24;VQSRTrancheSNP99.90to100.00|17L166(het)47;16;31;VQSRTrancheSNP99.00to99.90|1H321(het)49;30;19;VQSRTrancheSNP99.90to100.00|1K210(het)52;18;34;VQSRTrancheSNP9
1 16580 16580 C G GATK_Illumina 1.38 6K141(het)43;36;7;VQSRTrancheSNP99.90to100.00 0.125 noSample 0.125 noSample 0.430 noSample ncRNA_intronic;downstream WASH7P;MIR6859-1;MIR6859这是我原来的data.frame df.txt?dl=0的一个示例
当用户选择data.frame ID时,这将是新的17J965
Chr Start End Ref Alt Callers GATK_Illumina.counts GATK_Illumina.Zygosity GATK_Illumina.Depth GATK_Illumina.RefCount GATK_Illumina.AltCount GATK_Illumina.Filter GATK_SOLiD.counts GATK_SOLiD.Zygosity
1 14653 14653 C T GATK_Illumina 5.38 het 23 19 4 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 14907 14907 A G GATK_Illumina 6.38 het 57 40 17 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 14930 14930 A G GATK_Illumina 6.38 het 82 60 22 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 14933 14933 G A GATK_Illumina 2.38 het 88 76 12 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 16497 16497 A G GATK_Illumina 5.38 het 54 37 17 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 16534 16534 C T GATK_Illumina 5.38 het 19 12 7 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - -
1 16571 16571 G A GATK_Illumina 6.38 het 55 31 24 VQSRTrancheSNP99 0.125 - - - - - 0.125 - - - - - 0.430 - - 下面是指向结果ID dataframe,https://www.dropbox.com/s/2nfjud7xkb3b6mc/17J965.txt?dl=0的链接
编辑1
我有几个问题我不知道如何解决:
1)如何识别身份?ID总是链接到Callers列,也就是说,如果调用者是GATK_Illumina,那么ID将在GATK_Illumina.samples中,如果Callers列是GATK_Illumina,GATK_SOLID,那么ID可以在GATK_Illumina.samples和GATK_SOLID.samples两列中。这变得更加复杂,从大data.frame中可以看到,因为从字母数字ID中可以看到另一个值:示例的格式总是相同的-- alphanumeric code --然后是bracket、hom,或者het,another bracket,然后是3 values separated by a semicolon,character vector,然后是pipe (如果该行有其他ID信息)。17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99。不同的ID信息由|分隔,格式总是相同的。在本例中,只有一个包含ID信息的列,因为Caller列只有值GATK_Illumina,但这可能会因三个不同的调用方值而变得复杂。
2)标识了属于该ID的行之后,如何将信息放在一起?它只是从大型data.frame获得一个grep,然后再重新绑定?或基于ID的子集
3)在子ID dataframe中,需要根据ID列和Caller列创建一些列,例如:
# ID column for row 1 only for value of `Caller` column `GATK_Illumina`:
17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99如果我想为ID data.frame创建新的17J965,那么要创建的新列(如前面的示例所示)将是:
GATK_Illumina.Zygosity,GATK_Illumina.Depth,GATK_Illumina.RefCount,GATK_Illumina.AltCount,GATK_Illumina.Filter,GATK_SOLiD.Zygosity,GATK_SOLiD.Depth,GATK_SOLiD.RefCount,GATK_SOLiD.AltCount,GATK_SOLiD.Filter
A ID后的值将填充以下列:
GATK_Illumina.Zygosity GATK_Illumina.Depth GATK_Illumina.RefCount GATK_Illumina.AltCount GATK_Illumina.Filter GATK_SOLiD.Zygosity GATK_SOLiD.Depth GATK_SOLiD.RefCount GATK_SOLiD.AltCount GATK_SOLiD.Filter
het 23 19 4 VQSRTrancheSNP99 - - - - -注意,这些列是根据Callers列填充的,在本例中,Callers列是GATK_Illumina,然后只有使用GATK_Illumina重新创建的列才会被填充,对于其余的列,应该使用-或NA值。
到目前为止,我得到的是:
# Let's suppose that I want the ID: 17J965 and the big data.frame is call `big_df.txt`
big_df <- read.delim("big_df.txt")
sample <- grep("17J965", test2a$GATK_Illumina.samples)
df_sample <- big_df[sample,]
# df_sample has all the rows containing 17J965, but now I want to create the new data.frame with the extra columns and only select the correct values if there are more IDs 如何从列ID中提取所有相关信息
# I know that I can get the ID using this command
samples <- sub("\\(.*", "", b)但是如果我的兴趣在第二个管道上呢?
谢谢
发布于 2018-12-26 20:09:16
希望你已经找到了一个解决方案,如果不是这样的话,也许会有所帮助。
library(stringr)
library(microbenchmark)
library(tidyr)
library(dplyr)
df <- read.delim("big_df.txt", header=T,sep="\t")
df$GATK_Illumina.samples <- as.character(df$GATK_Illumina.samples)
# Recommend benchmarking to find fast functions for such large dataset
microbenchmark(df[grep("17J965", df$GATK_Illumina.samples),])
microbenchmark(dplyr::filter(df ,grepl("17J965", GATK_Illumina.samples)))
# Subset data
sample <- df[grep("17J965", df$GATK_Illumina.samples),]
# Split the column on "|" and extract the ID needed
sample$umm <- strsplit(sample$GATK_Illumina.samples,"\\|")
sample$umm <- sapply(sample$umm, function(x){x[grep("17J965",x)]})
# Removing brackets and ID to split to makes things easier to split.
# Probably should remove brackets from main df instead for each subset
sample$umm <- sub("17J965|\\(", "",sample$umm)
sample$umm <- sub(")",";",sample$umm)
# Split into columns, filter on the Caller and fill respective columns
sample <- sample %>% dplyr::filter(Callers == "GATK_Illumina") %>% separate(umm,sep=";",into = c("GATK_Illumina.Zygosity", "GATK_Illumina.Depth", "GATK_Illumina.RefCount", "GATK_Illumina.AltCount", "GATK_Illumina.Filter",fill= "left"))
sample <- sample %>% dplyr::filter(Callers == "GATK_SOLiD") %>% separate(umm,sep=";",into = c("GATK_SOLiD.Zygosity", "GATK_SOLiD.Depth", "GATK_SOLiD.RefCount", "GATK_SOLiD.AltCount", "GATK_SOLiD.Filter",fill= "left"))一些资源:
https://stackoverflow.com/questions/53887490
复制相似问题