df1 <-
Gene GeneLocus
CPA1|1357 chr7:130020290-130027948:+
GUCY2D|3000 chr17:7905988-7923658:+
UBC|7316 chr12:125396194-125399577:-
C11orf95|65998 chr11:63527365-63536113:-
ANKMY2|57037 chr7:16639413-16685398:- 预期产出
df2 <-
Gene.1 Gene.2 chr start end
CPA1 1357 7 130020290 130027948
GUCY2D 3000 17 7905988 7923658
UBC 7316 12 125396194 125399577
C11orf95 65998 11 63527365 63536113
ANKMY2 57037 7 16639413 16685398]]我试过这样..。
install.packages("splitstackshape")
library(splitstackshape)
df1 <- cSplit(df1,"Gene", sep="|", direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus",sep=":",direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus_2",sep="-",direction="wide", fixed=T)
df1 <- data.frame(df1)
df2$GeneLocus_1 <- gsub("chr","", df1$GeneLocus_1)我想知道是否还有其他更简单的方法
发布于 2015-09-22 13:49:51
在这里,go...Just忽略了不影响输出的警告;它实际上具有删除串信息(:+或:-)的副作用。
library(tidyr)
library(dplyr)
df1 %>% separate(Gene, c("Gene.1","Gene.2")) %>% separate(GeneLocus, c("chr","start","end")) %>% mutate(chr=sub("chr","",chr))输出:
Gene.1 Gene.2 chr start end
1 CPA1 1357 7 130020290 130027948
2 GUCY2D 3000 17 7905988 7923658
3 UBC 7316 12 125396194 125399577
4 C11orf95 65998 11 63527365 63536113
5 ANKMY2 57037 7 16639413 16685398发布于 2016-03-18 16:24:53
我建议采取以下方法:
cSplit“平衡”了正在拆分的列。因此,由于第一列在拆分时只生成2列,而第二列将导致4,因此需要从结果中删除列3和4。library(splitstackshape)
GLPat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
cSplit(as.data.table(mydf)[, GeneLocus := gsub(
GLPat, "\\1|\\2|\\3|\\4", GeneLocus)], names(mydf), "|")[
, 3:4 := NULL, with = FALSE][]
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -或者,您可以从我的"SOfun“包中尝试使用SOfun,您可以这样做:
library(SOfun)
Pat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
Fun <- function(invec) strsplit(gsub(Pat, "\\1|\\2|\\3|\\4", invec), "|", TRUE)
col_flatten(as.data.table(mydf)[, lapply(.SD, Fun)], names(mydf), drop = TRUE)
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -SOfun只在GitHub上,所以您可以用以下方式安装它:
source("http://news.mrdwab.com/install_github.R")
install_github("mrdwab/SOfun")https://stackoverflow.com/questions/32718419
复制相似问题