这是我数据框架的一部分。
> df
Group Direction cytoband q value residual q value wide peak boundaries
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554我想提取字符或数字后的"chr“在”宽峰边界“列。我尝试了下面的代码,但是第二行得到NA值。
library(tidyr)
df <- extract(df, 'wide peak boundaries', into = c('chr', 'start', 'end'),
'(\\d+)+:(\\d+)+-(\\d+)', remove = F, convert = T)
df
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 NA NA NA
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554数据
structure(list(Group = c("All", "All", "All", "All", "All"),
Direction = c("DEL", "DEL", "DEL", "DEL", "DEL"), cytoband = c("11q25",
"Xp22.11", "10q23.31", "22q12.3", "11p15.4"), `q value` = c("7.78E-43",
"3.01E-38", "3.61E-31", "4.03E-25", "6.59E-25"), `residual q value` = c("2.22E-39",
"1.91E-35", "3.61E-31", "3.96E-25", "6.59E-25"), `wide peak boundaries` = c("chr11:130906630-135086622",
"chrX:23277186-26139553", "chr10:87745632-87859602", "chr22:33050952-34766503",
"chr11:3230287-3799554"), chr = c(11L, NA, 10L, 22L, 11L),
start = c(130906630L, NA, 87745632L, 33050952L, 3230287L),
end = c(135086622L, NA, 87859602L, 34766503L, 3799554L)), class = "data.frame", row.names = c("V29",
"V30", "V31", "V32", "V33"))发布于 2021-11-09 08:58:10
只需将第一个捕获组中的\\d更改为\\w (\\d只匹配数字,而\\w匹配字母、数字和下划线):
编辑:(?<=chr)为正查找,它确保\\w只在字符串chr发生后才开始匹配:
df %>%
extract(col = 'wide peak boundaries',
into = c('chr', 'start', 'end'),
regex = '((?<=chr)\\w+):(\\d+)-(\\d+)',
remove = FALSE, convert = TRUE)
Group Direction cytoband q value residual q value wide peak boundaries chr start end
V29 All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 11 130906630 135086622
V30 All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 X 23277186 26139553
V31 All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 10 87745632 87859602
V32 All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 22 33050952 34766503
V33 All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 11 3230287 3799554发布于 2021-11-09 09:23:22
library(data.table)
setDT(mydata)[, c("chr", "start", "end") := tstrsplit(`wide peak boundaries`, "[:-]", perl = TRUE)]
Group Direction cytoband q value residual q value wide peak boundaries chr start end
1: All DEL 11q25 7.78E-43 2.22E-39 chr11:130906630-135086622 chr11 130906630 135086622
2: All DEL Xp22.11 3.01E-38 1.91E-35 chrX:23277186-26139553 chrX 23277186 26139553
3: All DEL 10q23.31 3.61E-31 3.61E-31 chr10:87745632-87859602 chr10 87745632 87859602
4: All DEL 22q12.3 4.03E-25 3.96E-25 chr22:33050952-34766503 chr22 33050952 34766503
5: All DEL 11p15.4 6.59E-25 6.59E-25 chr11:3230287-3799554 chr11 3230287 3799554https://stackoverflow.com/questions/69895156
复制相似问题