我有像这样的1.文件:
>YP_008856774.1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP_008856775.1
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP_008856776.1
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP_008856777.1
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR我想将每个标记(eg.>YP_008856776.1)重命名如下:
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR首先,我使用sed -i "s/\_//g" 1.file删除了\_。或者我应该删除标题的最后四个字符,然后添加_和“订单号”?简而言之,我想在>之后重命名标签;第一步是替换_;然后删除每个标签的最后四个字符,然后在每个标签后面添加_,最后在每个标签之后添加顺序编号。(eg.>YP_008856774.1 to >YP008856774.1 to > to 008856774.1 to > to 008856 to >YP008856_ to >YP008856_1)。我现在还不能用我的能力来实现它。你能帮我解决这些麻烦吗?谢谢。
发布于 2022-07-28 06:33:06
使用GNU awk
$ awk -F_ 'BEGIN {c=1} /^>/{match($2,/(.{6}).*/,a); $2=a[1] FS c++}1' OFS="" input_file
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR发布于 2022-07-28 12:10:12
在每个Unix框上使用任何shell中的任何awk:
$ awk '/>/{$0=substr($0,1,3) substr($0,5,6) "_" (++c)} 1' file
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR发布于 2022-07-28 06:39:40
$ awk '/^>/ { tag = substr($0,1,3) substr($0,5,6); $0 = sprintf("%s_%d", tag, ++count[tag]) }; 1' file
>YP008856_1
MHGTRTSAGWSTQPGKFDVLNLRMTFESSSAYQIPDLQPTEFIPTSLAAWNMPRHREYAAVSGGALHFFLDDYRFETVWS
>YP008856_2
MGGRGGGGGPGPGTGAKNKKAGGGSAGGLGGGGGSGGSSGGGGKGTGTTGTGGVQNGSGGGGNGAGGGSSNTTKPVEQYE
>YP008856_3
MQPPIEPVDPPTGDVSPYPNDLLILGGNRWLTITGRILHTPFGDQVELKPNTVKFWEAAAMRGQGKTLSELIV
>YP008856_4
MTWAGSRRRDELPPDWELKYRLPVLSAANWLCEVNGPGCVRAATDVDHKKRGNDHSRSNLQAICRVCHGRKSAAEGVARR上面的awk命令将重写每个标题行,方法是使用原始标题行的特定部分(字符1到3,从5到10,跳过位置4的_ )作为标记。为每个唯一标记维护一个计数器。
这假设原始标识符总是在表单XX_NNNNNN上,然后是任何其他文本(被忽略)。
你也可以用
awk '/^>/ { sub(/_/, ""); sub(/...\..*/, ""); tag = $0; $0 = sprintf("%s_%d", tag, ++count[tag]) }; 1' file这将稍微更具动态性,因为它在删除下划线和(包括)一组三个字符和一个点之后,从原始标识符的剩馀部分创建标记。
https://unix.stackexchange.com/questions/711490
复制相似问题