我有一个包含目标预测信息的txt文件,我想在R中将其解析为一个数据文件,文件中的信息已经以最简单的方式出现了。在未来的dataframe中,每一行将成为一行,其中只有4列应该如下所示:
MicroRNA Transcript Type Energy
miR-981|LQNS02278082.1_33127_3p TRINITY_GG_20135_c0_g1_i5.mrna1 7_A1 -0.70然而,我在R中所做的工作是行不通的。
a <- read_lines("results")
> head(a)
[1] "MicroRNA = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType = 7_A1\t\tEnergy = -0.70 Kcal/mol"
[2] "MicroRNA = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType = 7_A1\t\tEnergy = -5.77 Kcal/mol"
[3] "MicroRNA = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType = 7_A1\t\tEnergy = -1.77 Kcal/mol"
[4] "MicroRNA = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType = 7_A1\t\tEnergy = -5.20 Kcal/mol"
[5] "MicroRNA = LQNS02278075.1_32377_3p\t\tTranscript = TRINITY_GG_143691_c0_g1_i3.mrna1 Dir=sense TAG=Acidic phospholipase A2 PA4\t\tType = 7_A1\t\tEnergy = -3.30 Kcal/mol"
[6] "MicroRNA = miR-317|LQNS02000228.1_2413_3p\t\tTranscript = TRINITY_GG_4592_c2_g1_i10.mrna1 Dir=sense TAG=Serine/threonine-protein phosphatase 2A regulatory subunit B'' subunit gamma\t\tType = 7_m8\t\tEnergy = -6.35 Kcal/mol"
dput(head(a,4))
c("MicroRNA = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType = 7_A1\t\tEnergy = -0.70 Kcal/mol",
"MicroRNA = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType = 7_A1\t\tEnergy = -5.77 Kcal/mol",
"MicroRNA = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType = 7_A1\t\tEnergy = -1.77 Kcal/mol",
"MicroRNA = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType = 7_A1\t\tEnergy = -5.20 Kcal/mol"
)
re <- rex(
capture(name = "MicroRNA", alpha),
"[",
spaces,
capture(name = "Transcript", alpha),
"[",
spaces,
capture(name = "Type", alpha),
"[",
spaces,
capture(name = "Energy", digits),
"]:")
re_matches(a, re)
MicroRNA Transcript Type Energy
1 <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA>
3 <NA> <NA> <NA> <NA>知道如何在R或shell中这样做吗?谢谢!
发布于 2020-12-16 19:12:16
您可以尝试使用regex。
library(stringr)
#Example data
data <- c("MicroRNA = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType = 7_A1\t\tEnergy = -0.70 Kcal/mol",
"MicroRNA = miR-981|LQNS02278082.1_33127_3p\t\tTranscript = TRINITY_GG_20135_c0_g1_i5.mrna1 Dir=antisense TAG=Neuronal acetylcholine receptor subunit alpha-9\t\tType = 7_A1\t\tEnergy = -5.77 Kcal/mol",
"MicroRNA = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType = 7_A1\t\tEnergy = -1.77 Kcal/mol",
"MicroRNA = LQNS02278125.1_38470_3p\t\tTranscript = TRINITY_GG_22182_c1_g1_i2.mrna1 Dir=antisense TAG=Uncharacterized protein\t\tType = 7_A1\t\tEnergy = -5.20 Kcal/mol"
)
#Split the data
lines_split <- strsplit(data, split="\t\t", fixed=TRUE)
#No of columns
cols=1:4
#Rbind rows
df <- as.data.frame(do.call("rbind", lapply(lines_split, "[", cols)))
#Extract any info after =
df[,-2] <- lapply(df[,-2],function(x) trimws(sub('.*=', '', x)))
#Since this variable has two =, we extract info between = and Dir as per your
#Output
df$V2 <- str_match(df$V2, "=\\s*(.*?)\\s*Dir")[,2]
#Removing Kcal/mol
df$V4 <- as.numeric(str_replace(df$V4,"Kcal/mol",""))发布于 2020-12-16 18:55:37
使用read.table。
r <- read.table(text=a, sep="\t", colClasses=c(NA, "NULL"), header=TRUE)
nn <- unname(sapply(r, function(x) trimws(unique(sapply(strsplit(x, "="), `[`, 1)))))
res <- setNames(as.data.frame(sapply(r, function(x) sapply(strsplit(x, "="), `[`, 2))), nn)结果:
res
# MicroRNA Transcript Type Energy
# 1 miR-981|LQNS02278082.1_33127_3p TRINITY_GG_20135_c0_g1_i5.mrna1 Dir 7_A1 -5.77 Kcal/mol
# 2 LQNS02278125.1_38470_3p TRINITY_GG_22182_c1_g1_i2.mrna1 Dir 7_A1 -1.77 Kcal/mol
# 3 LQNS02278125.1_38470_3p TRINITY_GG_22182_c1_g1_i2.mrna1 Dir 7_A1 -5.20 Kcal/molhttps://stackoverflow.com/questions/65328456
复制相似问题