我在一个目录中有多个fasta文件。我需要为每个单独的fasta文件创建一个特定格式的文本文件。生成的文本文件将用作下游另一个程序的输入文件。我是python的新手,我确信我下面的脚本和它们一样笨拙。我相信在biopython中有更好的方法来完成这项任务。
import os
import re
for FILE in os.listdir():
if FILE.endswith(".fasta"):
OUTPUT = open(FILE+".lft",'w')
with open(FILE, 'r') as FIN:
for LINE in FIN:
if LINE.startswith('>'):
HEADER = re.sub('>','',LINE)
HEADER2 = re.sub('\n','',HEADER)
PART1_HEADER = HEADER2.split(":")
CONTIG = str(PART1_HEADER[0])
PART2_HEADER = PART1_HEADER[1]
SPLIT_PART2 = PART2_HEADER.split("-")
START = int(SPLIT_PART2[0])
END = int(SPLIT_PART2[1])
LENGTH = END-START
OUTPUT.write(str(START) + '\t' + str(HEADER2) + '\t' + str(LENGTH) + '\t' + str(CONTIG) + '\t' + str(END) + '\n')下面是每个fasta文件中的标头示例:
>Spp-0:0-500
>Spp-1:0-3538
>Spp-2:0-1421
>Spp-3:0-500其中:
“Spp”=物种名称
“-0”= fasta中的序列ID
“:0-500”=序列的开始和结束位置(并非全部从零开始)。
我希望生成一个如下所示的文本文件:
0 aVan-0:0-500 500 aVan-0 500
0 aVan-1:0-3538 3538 aVan-1 3538
0 aVan-2:0-1421 1421 aVan-2 1421
0 aVan-3:0-500 500 aVan-3 5001st column: start position
2nd column: original header
3rd column: end position
4th column: everything before the ":"
5th column: the length between start & stop positions我的代码运行得相当好,但是如果我在一个目录中有25个fasta序列,我的代码将只处理前24个序列。它还会在屏幕上输出一串数字(通常是41或44……不知道为什么?)我也想摆脱它。
发布于 2019-11-07 01:20:08
只是使用awk
awk -F ':' '/^>/ {split($2,a,/\-/);printf("%s\t%s\t%s\t%s\t%s\n",a[1],substr($0,2),a[2],substr($1,2),int(a[2])-int(a[1]));}' in.fasta
0 Spp-0:0-500 500 Spp-0 500
0 Spp-1:0-3538 3538 Spp-1 3538
0 Spp-2:0-1421 1421 Spp-2 1421
0 Spp-3:0-500 500 Spp-3 500发布于 2019-11-24 01:54:23
这里有一个建议,让你的代码更具“pythonic风格”:
#!/usr/bin/env python3
import os
def main():
# Don't use capital letters for variable names.
for filename in os.listdir():
if filename.endswith(".fasta"):
# Use context manager for both input and output.
with open(filename + ".lft", 'w') as output, \
open(filename, 'r') as fasta_in:
for line in fasta_in:
if line.startswith('>'):
# No need to use re.sub to just skip the first
# character (which you know is a '>').
# .strip() will remove "blanks" at ends.
header = line[1:].strip()
# You can "unpack" a list to directly store the
# parts in variables
[contig, coords] = header.split(":")
[start, end] = coords.split("-")
length = int(end) - int(start)
output.write("\t".join([
start, header, str(length), contig, end]))
output.write("\n")
main()https://stackoverflow.com/questions/58733410
复制相似问题