我有一个data.frame,它由一列组成,如下所示:
E1| E3|SAMD11 E3|SAMD11 E2|SAMD11 E10|SAMD11 E10|SAMD11 E10|SAMD11 E10|SAMD11 E10|SAMD11 E1| E2| E3| E3|PERM1 E9|AL645608.7;HES4;ISG15 E3|EGFR;HES4;PIK3CA
E*从1到10,我想数一数,每一个基因我有多少个,删除或者忽略了E~+后面跟着空的情况。预期的产出将是:
SAMD11: E3: 2 SAMD11: E2: 1 SAMD11: E10: E10: E3: 1 HES4: E9:1 HES4: E3: 1 AL645608.7 E9: 1 ISG15: E9: 1 EGFR: E3: 1 PIK3CA E3: 1
有人能帮我吗?
发布于 2019-12-15 11:17:20
library(dplyr)
library(tidyr)
#split on | then separate on ;
df %>% extract(id, into=c('id','gene'), regex="(.*)\\|(.*)?") %>%
separate_rows(gene, sep='\\;') %>%
filter(gene!="") %>%
count(gene, id)
# A tibble: 10 x 3
gene id n
<chr> <chr> <int>
1 AL645608.7 E9 1
2 EGFR E3 1
3 HES4 E3 1
4 HES4 E9 1
5 ISG15 E9 1
6 PERM1 E3 1
7 PIK3CA E3 1
8 SAMD11 E10 5
9 SAMD11 E2 1
10 SAMD11 E3 2发布于 2019-12-15 11:18:42
你可以试试data.table
library(data.table)
setDT(my_data) # convert to data.table
# add two columns, key and val, from the input column col
my_data[ , c('key', 'val') := tstrsplit(col, '|', fixed = TRUE)]
# drop rows with nothing on the RHS of |
my_data = my_data[!is.na(val)]
# unnest the ;-separated values
my_data = my_data[ , {
l = strsplit(val, ';', fixed = TRUE)
.(E = rep(E, each = lengths(l)), val = unlist(val))
}]
# count
my_data[ , .N, keyby = .(E, val)]发布于 2019-12-15 16:33:59
使用base R,将带有read.table的列读入两列data.frame,然后用strsplit拆分第二列,将其转换为两列data.frame,用table获取频率并将其转换为data.frame
d1 <- read.table(text = df1$id, header = FALSE, sep="|", stringsAsFactors = FALSE)
out <- subset(as.data.frame(table(stack(setNames(strsplit(d1$V2, ";"),
d1$V1))[2:1])), Freq > 0)
names(out) <- c("id", "gene", "n")
row.names(out) <- NULL
out
# id gene n
#1 E9 AL645608.7 1
#2 E3 EGFR 1
#3 E3 HES4 1
#4 E9 HES4 1
#5 E9 ISG15 1
#6 E3 PERM1 1
#7 E3 PIK3CA 1
#8 E3 SAMD11 2
#9 E2 SAMD11 1
#10 E10 SAMD11 5数据
df1 <- structure(list(id = c("E1|", "E3|SAMD11", "E3|SAMD11", "E2|SAMD11",
"E10|SAMD11", "E10|SAMD11", "E10|SAMD11", "E10|SAMD11", "E10|SAMD11",
"E1|", "E2|", "E3|", "E3|PERM1", "E9|AL645608.7;HES4;ISG15",
"E3|EGFR;HES4;PIK3CA")), class = "data.frame", row.names = c(NA,
-15L))https://stackoverflow.com/questions/59343305
复制相似问题