我有如下数据:
> sampledput
V1 V2 V3
1 GSM1010983 adipose Bisulfite-Seq
2 GSM1120330 adipose Bisulfite-Seq
3 GSM1120331 adipose Bisulfite-Seq
4 GSM1282348 adipose Bisulfite-Seq
5 GSM1282357 adipose Bisulfite-Seq
6 GSM906416 adipose ChIP-Seq input
7 GSM906394 adipose H3K27ac
8 GSM1010958 adipose mRNA-Seq
9 GSM1120304 adipose mRNA-Seq
10 GSM1120305 adipose mRNA-Seq
11 GSM621443 adipose derived mesenchymal stem cells ChIP-Seq input
12 GSM621420 adipose derived mesenchymal stem cells H3K27me3
13 GSM621446 adipose derived mesenchymal stem cells H3K36me3
14 GSM621418 adipose derived mesenchymal stem cells H3K4me1
15 GSM621458 adipose derived mesenchymal stem cells H3K4me3
16 GSM670020 adipose derived mesenchymal stem cells H3K9ac
17 GSM621398 adipose derived mesenchymal stem cells H3K9me3我希望保留列V2中的值保持不变的行(例如,adipose),而列V3中的值应该包含Bisulfite-Seq H3K27ac、ChIP-Seq input和mRNA-Seq.If -- V3中有重复的值,然后只取其中的1行,因为您可以看到,我只选择了一个具有值mRNA-Seq和Bisulfite-Seq的行,因此在本例中,我将得到如下输出:
5 GSM1282357 adipose Bisulfite-Seq
6 GSM906416 adipose ChIP-Seq input
7 GSM906394 adipose H3K27ac
8 GSM1010958 adipose mRNA-Seq以下是dput:
structure(list(V1 = structure(c(2L, 5L, 6L, 7L, 8L, 17L, 16L,
1L, 3L, 4L, 12L, 11L, 13L, 10L, 14L, 15L, 9L), .Label = c("GSM1010958",
"GSM1010983", "GSM1120304", "GSM1120305", "GSM1120330", "GSM1120331",
"GSM1282348", "GSM1282357", "GSM621398", "GSM621418", "GSM621420",
"GSM621443", "GSM621446", "GSM621458", "GSM670020", "GSM906394",
"GSM906416"), class = "factor"), V2 = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("adipose",
"adipose derived mesenchymal stem cells"), class = "factor"),
V3 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 10L, 10L, 10L,
2L, 4L, 5L, 6L, 7L, 8L, 9L), .Label = c("Bisulfite-Seq",
"ChIP-Seq input", "H3K27ac", "H3K27me3", "H3K36me3", "H3K4me1",
"H3K4me3", "H3K9ac", "H3K9me3", "mRNA-Seq"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -17L))发布于 2016-06-16 15:42:17
编辑:“更好”解决方案
实际上,我更喜欢这个代码,因为我认为代码更符合逻辑:
library(dplyr)
sampledput %>% group_by(V2) %>%
filter(all(c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq") %in% V3)) %>%
distinct(V2,V3)
Source: local data frame [4 x 3]
Groups: V2 [1]
V1 V2 V3
(fctr) (fctr) (fctr)
1 GSM1010983 adipose Bisulfite-Seq
2 GSM906416 adipose ChIP-Seq input
3 GSM906394 adipose H3K27ac
4 GSM1010958 adipose mRNA-Seq这将测试所有所需的V3值都包含在V2的每个值中。然后它仍然会过滤掉任何副本。
原始解
一种dplyr解
library(dplyr)
sampledput %>% group_by(V2) %>%
filter(V3 %in% c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq")) %>%
distinct(V2,V3) %>% filter(length(unique(V3))==4)
Source: local data frame [4 x 3]
Groups: V2 [2]
V1 V2 V3
(fctr) (fctr) (fctr)
1 GSM1010983 adipose Bisulfite-Seq
2 GSM906416 adipose ChIP-Seq input
3 GSM906394 adipose H3K27ac
4 GSM1010958 adipose mRNA-Seq但是请注意,在执行distinct(V2,V3)时,它将获取该副本的第一次出现。在您想要的输出中,您列出了GSM1282357,而我的解决方案返回GSM1010983。不知道这是不是你关心的问题。
您必须测试这是否泛化到整个数据集,但它确实产生了所需的输出。
发布于 2016-06-16 15:50:49
也许有点太简单了但是..。
library(dplyr)
result <- sampledput %>% group_by(V2, V3) %>% summarise(V1 = V1[length(V1)])这将返回每个组的最后GSM,就像您理想的输出一样。
发布于 2016-06-16 17:18:46
我们也可以使用data.table
library(data.table)
setDT(sampledput)[, .(V1 = last(V1)), .(V2, V3)]https://stackoverflow.com/questions/37863131
复制相似问题