我有一个这样的输入文件:
@sample1
ATGGTTCCAAGGCCTTGGTTAATTGGGGGGTTTTTTTTTTTTTTTTTTT
@sample2
TTGGAACCTTGGCCAATTAAGGGGGGGGGTTTTTTTCCCCCCCCCCCCC
@sample3
GGTTGGTTGGGAATTTGGTTAACCTTTTTAAATTTTTTTTTTTGGGGGG
AATTTTTTTTTTTTTGG我想打印出有特定最小长度的行。例如,如果我想要的最小长度是66,那么输出将是:
@sample3
GGTTGGTTGGGAATTTGGTTAACCTTTTTAAATTTTTTTTTTTGGGGGG
AATTTTTTTTTTTTTGG因为只有样本3的序列具有最小长度66
到目前为止,我的代码如下:
fastfile = {}
with open(sys.argv[1]) as f:
for line in f:
line = line.strip()
if not line:
continue
if line.startswith("@"):
sequencenumber = line[1:]
if sequencenumber not in fastfile:
fastfile[sequencenumber] = []
continue
sequence = line
fastfile[sequencenumber].append(sequence)
output = []
for key, value in fastfile.items():
if len(value) >= sys.argv[2]:
output.append(value)
print (output)Argv1是输入文件的路径,argv2是特定的最小长度。
发布于 2019-11-23 11:10:50
您希望快速文件字典的值是字符串,而不是列表,因此,您需要将它们连接到运行字符串,而不是将连续的序列附加到运行列表中:
fastfile = {}
with open(sys.argv[1]) as f:
for line in f:
line = line.strip()
if not line:
continue
if line[0] == "@":
sequencenumber = line[1:]
if sequencenumber not in fastfile:
fastfile[sequencenumber] = ""
continue
fastfile[sequencenumber] += line
output = []
for key, value in fastfile.items():
if len(value) >= sys.argv[2]:
output.append(value)
print (output)或者,如果您需要像最初那样将字符串存储在列表中,则使用"".join(value)将所有字符串连接在一起,如下所示:
output = []
for key, value in fastfile.items():
if len("".join(value)) >= sys.argv[2]:
output.append("".join(value))
output发布于 2019-11-23 11:33:56
这看起来简单得多:
with open(argv[1]) as fin :
text = fin.read()
min_length = int(argv[2])
parts = text.split('@')
# choose only the parts that have strings over the min_length
parts = [p for p in parts if any(len(i) > min_length for i in p.split('\n'))]
output = '@'.join( parts )https://stackoverflow.com/questions/59004153
复制相似问题