我有一个文件,它的行是这样的:
EF457507|S000834932 Root;Bacteria;"Acidobacteria";Acidobacteria_Gp4;Gp4
EF457374|S000834799 Root;Bacteria;"Acidobacteria";Acidobacteria_Gp14;Gp14
AJ133184|S000323093 Root;Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family I;GpI
DQ490004|S000686022 Root;Bacteria;"Armatimonadetes";Armatimonadetes_gp7
AF268998|S000340459 Root;Bacteria;TM7;TM7_genera_incertae_sedis我想打印第一个制表符和最后一个分号之间的任何内容,就像这样
EF457507|S000834932 Gp4
EF457374|S000834799 Gp14
AJ133184|S000323093 GpI
DQ490004|S000686022 Armatimonadetes_gp7
AF268998|S000340459 TM7_genera_incertae_sedis我试着使用正则表达式,但它不起作用,有没有办法用Linux,awk或Perl来做呢?
发布于 2012-12-20 22:22:10
您可以使用sed
sed 's/\t.*;/\t/' file
## This matches a tab character '\t'; followed by any character '.' any number of
## times '*'; followed by a semicolon; and; replaces all of this with a tab
## character '\t'.
sed 's/[^\t]*;//' file
## Things inside square brackets become a character class. For example, '[0-9]'
## is a character class. Obviously, this would match any digit between zero and
## nine. However, when the first character in the character class is a '^', the
## character class becomes negated. So '[^\t]*;' means match anything not a tab
## character any number of times followed by a semicolon.或awk
awk 'BEGIN { FS=OFS="\t" } { sub(/.*;/,"",$2) }1' file
awk '{ sub(/[^\t]*;/,"") }1' file结果:
EF457507|S000834932 Gp4
EF457374|S000834799 Gp14
AJ133184|S000323093 GpI
DQ490004|S000686022 Armatimonadetes_gp7
AF268998|S000340459 TM7_genera_incertae_sedis根据下面的评论,“删除最后一个分号之后的所有内容”,使用sed
sed 's/[^;]*$//' file
## '[^;]*$' will match anything not a semicolon any number of times anchored to
## the end of the line.或awk
awk 'BEGIN { FS=OFS="\t" } { sub(/[^;]*$/,"",$2) }1' file
awk '{ sub(/[^;]*$/,"") }1' filehttps://stackoverflow.com/questions/13974083
复制相似问题