文章/答案/技术大牛

发布

社区首页 >问答首页 >根据另一个data.frame中的值创建新的data.frame

问根据另一个data.frame中的值创建新的data.frame
EN

Stack Overflow用户

提问于 2018-12-21 15:44:42

回答 1查看 3K关注 0票数 1

我有一个巨大的data.frame，它有大约300万行和100列。在其中一列中，有关于ID的信息。我需要创建一个R脚本，它可以用于基于这个ID生成一个新的data.frame。基本上，这个新的data.frame将只包含找到这个ID的行和来自大data.frame的其余列，再加上一些需要根据ID column中包含的信息创建的额外列。最后一个想法是将其实现到一个闪亮的应用程序中，这样用户就可以输入名称ID，然后新的data.frame就可以可视化了。

下面是我的数据的一个例子。这将是我需要按示例拆分的大型data.frame：

Chr     Start   End     Ref     Alt     Callers GATK_Illumina.counts    GATK_Illumina.samples   GATK_SOLiD.counts       GATK_SOLiD.samples      LIFE_SOLiD.counts       LIFE_SOLiD.samples      TVC_Ion.counts  TVC_Ion.samples Func.refGene    
1       14653   14653   C       T       GATK_Illumina   5.38    17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99
1       14677   14677   G       A       GATK_Illumina   2.38    1H321(het)16;6;10;VQSRTrancheSNP99.90to100.00|1K210(het)24;18;6;VQSRTrancheSNP99.90to100.00     0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_ex
1       14815   14815   C       T       GATK_Illumina   1.38    1H321(het)14;12;2;VQSRTrancheSNP99.90to100.00   0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic;downstream WASH7P;DDX11L1  dist=406        
1       14825   14825   G       A       GATK_Illumina   1.38    1H321(het)13;11;2;VQSRTrancheSNP99.90to100.00   0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic;downstream WASH7P;DDX11L1  dist=416        
1       14907   14907   A       G       GATK_Illumina   6.38    17J965(het)57;40;17;VQSRTrancheSNP99.90to100.00|17L166(het)26;15;11;VQSRTrancheSNP99.00to99.90|1H321(het)27;14;13;VQSRTrancheSNP99.90to100.00|1K210(het)42;24;18;VQSRTrancheSNP9
1       14930   14930   A       G       GATK_Illumina   6.38    17J965(het)82;60;22;VQSRTrancheSNP99.90to100.00|17L166(het)38;23;15;VQSRTrancheSNP99.00to99.90|1H321(het)31;17;14;VQSRTrancheSNP99.00to99.90|1K210(het)52;28;24;VQSRTrancheSNP99
1       14933   14933   G       A       GATK_Illumina   2.38    17J965(het)88;76;12;VQSRTrancheSNP99.90to100.00|5G540B(het)77;57;20;VQSRTrancheSNP99.90to100.00 0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_in
1       14948   14948   G       A       GATK_Illumina   1.38    5G540B(het)75;63;12;VQSRTrancheSNP99.90to100.00 0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_intronic;downstream       WASH7P;DDX11L1  dist=539
1       14976   14976   G       A       GATK_Illumina   1.38    5G540B(het)62;50;12;VQSRTrancheSNP99.90to100.00 0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic;downstream WASH7P;DDX11L1  dist=567        
1       15903   15903   -       C       GATK_Illumina   1.38    1K210(hom)2;0;2;VQSRTrancheINDEL99.00to99.90    0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_exonic    WASH7P                          ncRNA_ex
1       16495   16495   G       C       GATK_Illumina   5.38    17L166(het)80;57;23;VQSRTrancheSNP99.90to100.00|1H321(het)48;21;27;VQSRTrancheSNP99.90to100.00|1K210(het)59;38;21;VQSRTrancheSNP99.90to100.00|5G540B(het)95;77;18;VQSRTrancheSNP
1       16497   16497   A       G       GATK_Illumina   5.38    17J965(het)54;37;17;VQSRTrancheSNP99.90to100.00|17L166(het)74;61;13;VQSRTrancheSNP99.90to100.00|1K210(het)48;39;9;VQSRTrancheSNP99.90to100.00|5G540B(het)86;66;20;VQSRTrancheSNP
1       16534   16534   C       T       GATK_Illumina   5.38    17J965(het)19;12;7;VQSRTrancheSNP99.90to100.00|17L166(het)10;4;6;VQSRTrancheSNP99.90to100.00|1K210(het)8;4;4;VQSRTrancheSNP99.90to100.00|5G540B(het)14;7;7;VQSRTrancheSNP99.90to
1       16571   16571   G       A       GATK_Illumina   6.38    17J965(het)55;31;24;VQSRTrancheSNP99.90to100.00|17L166(het)47;16;31;VQSRTrancheSNP99.00to99.90|1H321(het)49;30;19;VQSRTrancheSNP99.90to100.00|1K210(het)52;18;34;VQSRTrancheSNP9
1       16580   16580   C       G       GATK_Illumina   1.38    6K141(het)43;36;7;VQSRTrancheSNP99.90to100.00   0.125   noSample        0.125   noSample        0.430   noSample        ncRNA_intronic;downstream       WASH7P;MIR6859-1;MIR6859

这是我原来的data.frame df.txt?dl=0的一个示例

当用户选择data.frame ID时，这将是新的17J965

Chr     Start   End     Ref     Alt     Callers GATK_Illumina.counts    GATK_Illumina.Zygosity  GATK_Illumina.Depth     GATK_Illumina.RefCount  GATK_Illumina.AltCount  GATK_Illumina.Filter    GATK_SOLiD.counts       GATK_SOLiD.Zygosity     
1       14653   14653   C       T       GATK_Illumina   5.38    het     23      19      4       VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       14907   14907   A       G       GATK_Illumina   6.38    het     57      40      17      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       14930   14930   A       G       GATK_Illumina   6.38    het     82      60      22      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       14933   14933   G       A       GATK_Illumina   2.38    het     88      76      12      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       16497   16497   A       G       GATK_Illumina   5.38    het     54      37      17      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       16534   16534   C       T       GATK_Illumina   5.38    het     19      12      7       VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -       
1       16571   16571   G       A       GATK_Illumina   6.38    het     55      31      24      VQSRTrancheSNP99        0.125   -       -       -       -       -       0.125   -       -       -       -       -       0.430   -       -

下面是指向结果ID dataframe，https://www.dropbox.com/s/2nfjud7xkb3b6mc/17J965.txt?dl=0的链接

编辑1

我有几个问题我不知道如何解决：

1)如何识别身份？ID总是链接到Callers列，也就是说，如果调用者是GATK_Illumina，那么ID将在GATK_Illumina.samples中，如果Callers列是GATK_Illumina,GATK_SOLID，那么ID可以在GATK_Illumina.samples和GATK_SOLID.samples两列中。这变得更加复杂，从大data.frame中可以看到，因为从字母数字ID中可以看到另一个值:示例的格式总是相同的-- alphanumeric code --然后是bracket、hom，或者het，another bracket，然后是3 values separated by a semicolon，character vector，然后是pipe (如果该行有其他ID信息)。17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99。不同的ID信息由|分隔，格式总是相同的。在本例中，只有一个包含ID信息的列，因为Caller列只有值GATK_Illumina，但这可能会因三个不同的调用方值而变得复杂。

2)标识了属于该ID的行之后，如何将信息放在一起？它只是从大型data.frame获得一个grep，然后再重新绑定？或基于ID的子集

3)在子ID dataframe中，需要根据ID列和Caller列创建一些列，例如：

 # ID column for row 1 only for value of `Caller` column `GATK_Illumina`:
 17J965(het)23;19;4;VQSRTrancheSNP99.90to100.00|17L166(het)10;5;5;VQSRTrancheSNP99.00to99.90|1H321(het)7;4;3;VQSRTrancheSNP99.90to100.00|1K210(het)15;9;6;VQSRTrancheSNP99.00to99

如果我想为ID data.frame创建新的17J965，那么要创建的新列(如前面的示例所示)将是：

GATK_Illumina.Zygosity，GATK_Illumina.Depth，GATK_Illumina.RefCount，GATK_Illumina.AltCount，GATK_Illumina.Filter，GATK_SOLiD.Zygosity，GATK_SOLiD.Depth，GATK_SOLiD.RefCount，GATK_SOLiD.AltCount，GATK_SOLiD.Filter

A ID后的值将填充以下列：

GATK_Illumina.Zygosity  GATK_Illumina.Depth     GATK_Illumina.RefCount  GATK_Illumina.AltCount  GATK_Illumina.Filter     GATK_SOLiD.Zygosity GATK_SOLiD.Depth     GATK_SOLiD.RefCount    GATK_SOLiD.AltCount     GATK_SOLiD.Filter 
het     23      19      4       VQSRTrancheSNP99     -     -     -     -     -

注意，这些列是根据Callers列填充的，在本例中，Callers列是GATK_Illumina，然后只有使用GATK_Illumina重新创建的列才会被填充，对于其余的列，应该使用-或NA值。

到目前为止，我得到的是：

# Let's suppose that I want the ID: 17J965 and the big data.frame is call `big_df.txt`

 big_df <- read.delim("big_df.txt")
 sample <- grep("17J965", test2a$GATK_Illumina.samples)
 df_sample <- big_df[sample,]

 # df_sample has all the rows containing 17J965, but now I want to create the new data.frame with the extra columns and only select the correct values if there are more IDs

如何从列ID中提取所有相关信息

# I know that I can get the ID using this command 
samples <- sub("\\(.*", "", b)

但是如果我的兴趣在第二个管道上呢？

谢谢

dataframe

回答 1

Stack Overflow用户

发布于 2018-12-26 20:09:16

希望你已经找到了一个解决方案，如果不是这样的话，也许会有所帮助。

library(stringr)
library(microbenchmark)
library(tidyr)
library(dplyr)


df <- read.delim("big_df.txt", header=T,sep="\t")
df$GATK_Illumina.samples <- as.character(df$GATK_Illumina.samples)

# Recommend benchmarking to find fast functions for such large dataset
microbenchmark(df[grep("17J965", df$GATK_Illumina.samples),])
microbenchmark(dplyr::filter(df ,grepl("17J965", GATK_Illumina.samples)))

# Subset data
sample <- df[grep("17J965", df$GATK_Illumina.samples),] 

# Split the column on "|" and extract the ID needed
sample$umm <- strsplit(sample$GATK_Illumina.samples,"\\|")
sample$umm <- sapply(sample$umm, function(x){x[grep("17J965",x)]})

# Removing brackets and ID to split to makes things easier to split. 
# Probably should remove brackets from main df instead for each subset 
sample$umm <- sub("17J965|\\(", "",sample$umm)
sample$umm <- sub(")",";",sample$umm)

# Split into columns, filter on the Caller and fill respective columns
sample <- sample %>% dplyr::filter(Callers == "GATK_Illumina") %>% separate(umm,sep=";",into = c("GATK_Illumina.Zygosity", "GATK_Illumina.Depth", "GATK_Illumina.RefCount", "GATK_Illumina.AltCount", "GATK_Illumina.Filter",fill= "left"))

sample <- sample %>% dplyr::filter(Callers == "GATK_SOLiD") %>% separate(umm,sep=";",into = c("GATK_SOLiD.Zygosity", "GATK_SOLiD.Depth", "GATK_SOLiD.RefCount",  "GATK_SOLiD.AltCount", "GATK_SOLiD.Filter",fill= "left"))

一些资源：

在R中更快地对数据帧行进行子集？

单独()

使用dplyr筛选包含特定字符串的行。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53887490

复制

相似问题

问根据另一个data.frame中的值创建新的data.frame
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据另一个data.frame中的值创建新的data.frameEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据另一个data.frame中的值创建新的data.frame
EN