文章/答案/技术大牛

发布

社区首页 >问答首页 >Python: Rosalind共识和简介

问Python: Rosalind共识和简介
EN

Stack Overflow用户

提问于 2020-04-17 18:43:14

回答 1查看 727关注 0票数 2

我正在努力解决罗莎琳德面临的“共识和形象”挑战。质疑指示如下：

给定:以FASTA格式最多收集10条相同长度(最多为1 kbp)的DNA字符串。

返回:集合的一致字符串和配置文件矩阵。(如果存在多个可能的协商一致字符串，则可以返回其中的任何一个。)

我的代码如下(我从这个网站上的另一个用户那里得到了大部分代码)。我唯一的问题是，一些DNA链被分解成多个不同的行，因此它们被作为单独的字符串附加到“allstring”列表中。我试图找出如何将每一行不包含">“的连续行写成一个字符串。

import numpy as np

seq = []
allstrings = []
temp_seq = []
matrix = []
C = []
G = []
T = []
A = []
P = []
consensus = []
position = 1

file = open("C:/Users/knigh/Documents/rosalind_cons (3).txt", "r")
conout = open("C:/Users/knigh/Documents/consensus.txt", "w")

# Right now, this is reading and writing each as an individual line. Thus, it
#  is splitting each sequence into multiple small sequences. You need to figure
#  out how to read this in FASTA format to prevent this from occurring
desc = file.readlines()

for line in desc:
    allstrings.append(line)

for string in range(1, len(allstrings)):
    if ">" not in allstrings[string]:
        temp_seq.append(allstrings[string])
    else:
        seq.insert(position, temp_seq[0])
        temp_seq = []
        position += 1

# This last insertion into the sequence must be performed after the loop to empty
#  out the last remaining string from temp_seq
seq.insert(position, temp_seq[0])

for base in seq:
    matrix.append([pos for pos in base])

M = np.array(matrix).reshape(len(seq), len(seq[0]))

for base in range(len(seq[0])):
    A_count = 0
    C_count = 0
    G_count = 0
    T_count = 0
    for pos in M[:, base]:
        if pos == "A":
            A_count += 1
        elif pos == "C":
            C_count += 1
        elif pos == "G":
            G_count += 1
        elif pos == "T":
            T_count += 1
    A.append(A_count)
    C.append(C_count)
    G.append(G_count)
    T.append(T_count)

profile_matrix = {"A": A, "C": C, "G": G, "T": T}

P.append(A)
P.append(C)
P.append(G)
P.append(T)

profile = np.array(P).reshape(4, len(A))

for pos in range(len(A)):
    if max(profile[:, pos]) == profile[0, pos]:
        consensus.append("A")
    elif max(profile[:, pos]) == profile[1, pos]:
        consensus.append("C")
    elif max(profile[:, pos]) == profile[2, pos]:
        consensus.append("G")
    elif max(profile[:, pos]) == profile[3, pos]:
        consensus.append("T")

conout.write("".join(consensus) + "\n")

for k, v in profile_matrix.items():
    conout.write(k + ": " + " ".join(str(x) for x in v) + "\n")

conout.close()

python

string

bioinformatics

rosalind

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-04-17 20:37:12

有几种方法可以将FASTA文件作为记录来迭代。您可以使用预先构建的库或编写自己的库。

用于处理序列数据的一个广泛使用的库是biopython。此代码段将创建一个字符串列表。

from Bio import SeqIO


file = "path/to/your/file.fa"
sequences = []

with open(file, "r") as file_handle:
    for record in SeqIO.parse(file_handle, "fasta"):
        sequences.append(record.seq)

或者，您可以编写自己的FASTA解析器。像这样的事情应该有效：

def read_fasta(fh):
    # Iterate to get first FASTA header        
    for line in fh:
        if line.startswith(">"):
            name = line[1:].strip()
            break

    # This list will hold the sequence lines
    fa_lines = []

    # Now iterate to find the get multiline fasta
    for line in fh:
        if line.startswith(">"):
            # When in this block we have reached 
            #  the next FASTA record

            # yield the previous record's name and
            #  sequence as tuple that we can unpack
            yield name, "".join(fa_lines)

            # Reset the sequence lines and save the
            #  name of the next record
            fa_lines = []
            name = line[1:].strip()

            # skip to next line
            continue

        fa_lines.append(line.strip())

    yield name, "".join(fa_lines)

您可以这样使用这个函数：

file = "path/to/your/file.fa"
sequences = []

with open(file, "r") as file_handle:
    for name, seq in read_fasta(file_handle):
        sequences.append(seq)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61278530

复制

相似问题

问Python: Rosalind共识和简介
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python: Rosalind共识和简介EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python: Rosalind共识和简介
EN