我有几个大的文件,充满了生物序列,我想打破一个特定数量的序列。但是,在此之前需要进行一些文件格式化。
在每个生物序列中,有一行以“>k.”开头。这是序列头。接下来的1条或1+线是生物序列。所有序列都有一个标题,但有些序列有两行或更多行序列。我想将同一序列标题下的序列组合起来,将多行序列转换为1长序列。
>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAE
VRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGW
TCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATT
GNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPE
KDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTF
RVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVH
QRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTAT
RLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPS
SNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNI
I
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIE
ERFADEIEALYAPLVRS*我目前正在使用一个while循环,它遍历文本文件的所有行。然后,我在if- awk语句中使用awk和模式匹配来检查是否有两条连续的行中有序列。然而,我的模式匹配不起作用,我也不知道为什么。
我试过grep,但是grep一次读取整个文件。在尝试从特定行中删除\n字符时,我尝试过sed但是that didn't work,以及尝试使用tr。
我希望有任何帮助,包括一个完全不同的方法来处理这个问题。
file=protein_test.fa
i=1 # for line counter
prot_reg="^[A-Z]{10,}" # regex for biological sequence
while read -r line; do
# Read in 2 lines at the same time
awk1=$(awk -v i=$i 'NR==i' < $file)
awk2=$(awk -v i=$i 'NR==i+1' < $file)
if [[ ${awk1}=~$prot_reg && ${awk2}=~$prot_reg ]]
then
echo $awk1$awk2
else
echo $awk1
echo $awk2
fi
let i=i+1 # til all lines read, add 1 to i
done < $file以下是我想要的:
>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAEVRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGWTCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATTGNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPEKDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTFRVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVHQRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTATRLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPSSNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNII
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIEERFADEIEALYAPLVRS*发布于 2019-11-22 00:46:55
这就是你想做的吗?
$ awk '{printf "%s", (/^>/ ? s $0 ORS : $0); s=ORS} END{print ""}' file
>k141_0_1 # 86 # 388 # -1 # ID=1_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.703
PSSSAVGGTVTGDTQGCWRVDELRLRGGDDAEWARVIETHSAIIESVLRRRVGDASMRAEVRDAVWARAFFEGLEPGEHAPVLPKELAEKPRLGGDRHRE*
>k141_964934_1 # 3 # 341 # -1 # ID=2_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.699
AVSTCVYVYSLGYMDRDVEDRADPVRAPNVRRFFCLLDFFLFSMAILVLAGNIAVLLIGWTCVGLSSFLLISYWTGKPGTLSAGLQALAANAIGDAALLVALVLVPAGCGDLL
>k141_1688630_1 # 1 # 150 # 1 # ID=3_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.707
ALLLVLVLRVVHAYHSERLSDVADEEAELNARLEREEAPQHAEEEAAAL*
>k141_1688630_2 # 147 # 416 # 1 # ID=3_2;partial=01;start_type=GTG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.748
MTALSIALLAPWAAGIVLVALDGRRRLIGWLAIGALFANLAGLTILAVSVLSDDPEVATTGNWPTGVGITLRADALGVLFALLSSPRAAR
>k141_361851_1 # 2 # 388 # 1 # ID=4_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.721
PATLTAAGGVFASGLSSRGRLVSGAAPKFYGNPLVAWTPAPGASAYEVQWSKTRYPFRPEKDPQNGNAFGRLTLGTSAVLPLRPGVWYYRVRGFSFALPTGAQQLSWSDPARIVVAKPTFRVVRRKHK*
>k141_241234_1 # 224 # 373 # -1 # ID=5_1;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.713
MVSLIGGLLTFTLGTGLVTWGAAVRGAMEHDGTLRGAGRLPQGASQEAS*
>k141_1206166_1 # 179 # 322 # -1 # ID=6_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
YQYAWTDLLGPTLVWDQVARGVLWSLAYSLVLYAAAWWHFLRKDVLS*
>k141_482468_1 # 123 # 314 # -1 # ID=7_1;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.615
AQFALRWILMNEAVSVVIPGARNPEQAIANTQASELPALSVNQMEAANAIFDRLIRPHVHQRW*
>k141_1447399_1 # 3 # 317 # -1 # ID=8_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.711
RTRPGSSPRGWFGPHLEALWTYLHEHHHISYARLEAIGRDLWHLAVSQGALANALRRTATRLRPEAGAIREQVRASPTIGSDETSARVNGRTHWQWVFQTPTASY
>k141_1_1 # 2 # 364 # 1 # ID=9_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.653
SYMFVTGPDVVKTVTHETVTQEELGGAVTHTTRSGVADLAFENDVEALLQLRRFMDFLPSSNREKPPVRPTWDSPDREEASLDTLIPANPNKPYDMKELILKVVDEGDFFEIQPTYARNII
>k141_964935_1 # 2 # 235 # 1 # ID=10_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.726
LADLSRDPDLRAELQRAVDRVNDGAAHHARIRRFSVVPRPFSADAGEITPTLKLKRRVIEERFADEIEALYAPLVRS*请阅读以下内容,以了解您发布的代码的主要问题:
https://stackoverflow.com/questions/58985810
复制相似问题