这是我的样本数据
dput(aa)
structure(list(V4 = structure(1:22, .Label = c("Peak228404",
"Peak228411", "Peak228413", "Peak228423", "Peak228424", "Peak228439",
"Peak228461", "Peak228476", "Peak228479", "Peak228495", "Peak228528",
"Peak228553", "Peak228603", "Peak228612", "Peak228629", "Peak228630",
"Peak228642", "Peak228651", "Peak228691", "Peak228740", "Peak4983",
"Peak5261"), class = "factor"), annotation = structure(c(1L,
4L, 5L, 1L, 1L, 1L, 6L, 8L, 1L, 1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L,
8L, 7L, 8L, 2L, 3L), .Label = c("Distal Intergenic", "Downstream (1-2kb)",
"Downstream (2-3kb)", "Exon (ENST00000370460.6/2334, exon 16 of 21)",
"Exon (ENST00000370460.6/2334, exon 21 of 21)", "Exon (ENST00000616857.4/84548, exon 3 of 3)",
"Exon (ENST00000620118.4/ENST00000620118.4, exon 3 of 4)", "Promoter"
), class = "factor"), Output_required = structure(c(1L, 5L, 5L,
1L, 1L, 1L, 5L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L, 4L,
6L, 2L, 3L), .Label = c("Distal Intergenic", "Downstream (1-2kb)",
"Downstream (2-3kb)", "Exon", "Exon ", "Promoter"), class = "factor")), class = "data.frame", row.names = c(NA,
-22L))这
V4 annotation Output_required
1 Peak228404 Distal Intergenic Distal Intergenic
2 Peak228411 Exon (ENST00000370460.6/2334, exon 16 of 21) Exon
3 Peak228413 Exon (ENST00000370460.6/2334, exon 21 of 21) Exon
4 Peak228423 Distal Intergenic Distal Intergenic
5 Peak228424 Distal Intergenic Distal Intergenic
6 Peak228439 Distal Intergenic Distal Intergenic
7 Peak228461 Exon (ENST00000616857.4/84548, exon 3 of 3) Exon
8 Peak228476 Promoter Promoter
9 Peak228479 Distal Intergenic Distal Intergenic
10 Peak228495 Distal Intergenic Distal Intergenic
11 Peak228528 Distal Intergenic Distal Intergenic
12 Peak228553 Distal Intergenic Distal Intergenic
13 Peak228603 Distal Intergenic Distal Intergenic
14 Peak228612 Distal Intergenic Distal Intergenic
15 Peak228629 Promoter Promoter
16 Peak228630 Promoter Promoter
17 Peak228642 Promoter Promoter
18 Peak228651 Promoter Promoter
19 Peak228691 Exon (ENST00000620118.4/ENST00000620118.4, exon 3 of 4) Exon
20 Peak228740 Promoter Promoter
21 Peak4983 Downstream (1-2kb) Downstream (1-2kb)
22 Peak5261 Downstream (2-3kb) Downstream (2-3kb)因此,在这个数据帧中,称为注释的列中有行,它包含字符串Exon,因此每个行中都有我不想要的括号内的文本,因为我希望保持它的一致性,这就是Exon。我添加了另一列Output_required,这是我想要的最终输出。
任何建议或帮助都将不胜感激。
发布于 2021-09-24 11:19:51
在'Exon'可以在lookbehind的帮助下编写之后删除所有内容。
sub('(?<=Exon).*', '', aa$annotation, perl = TRUE)
# [1] "Distal Intergenic" "Exon" "Exon" "Distal Intergenic"
# [5] "Distal Intergenic" "Distal Intergenic" "Exon" "Promoter"
# [9] "Distal Intergenic" "Distal Intergenic" "Distal Intergenic" "Distal Intergenic"
#[13] "Distal Intergenic" "Distal Intergenic" "Promoter" "Promoter"
#[17] "Promoter" "Promoter" "Exon" "Promoter"
#[21] "Downstream (1-2kb)" "Downstream (2-3kb)"同样,也可以使用stringr::str_remove。
stringr::str_remove(aa$annotation, '(?<=Exon).*')发布于 2021-09-24 12:34:57
实现目标的另一种方法是使用反向引用:
sub("(Exon)(.*)", "\\1", aa$annotation)在这里,我们将字符串划分为两个捕获组:
(Exon):这个组从字面上捕获Exon(.*):--这个组捕获了sub的替换参数中使用的所有else\\1:_ the反向引用,“回忆”第一个捕获组,但不是第二个,从而有效地删除了 it!。
发布于 2021-09-24 18:14:48
我们可以使用来自trimws的base R
trimws(aa$annotation, whitespace = "(?<=Exon).*")
[1] "Distal Intergenic" "Exon" "Exon" "Distal Intergenic" "Distal Intergenic" "Distal Intergenic" "Exon"
[8] "Promoter" "Distal Intergenic" "Distal Intergenic" "Distal Intergenic" "Distal Intergenic" "Distal Intergenic" "Distal Intergenic"
[15] "Promoter" "Promoter" "Promoter" "Promoter" "Exon" "Promoter" "Downstream (1-2kb)"
[22] "Downstream (2-3kb)"https://stackoverflow.com/questions/69314123
复制相似问题