首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python: Rosalind共识和简介

Python: Rosalind共识和简介
EN

Stack Overflow用户
提问于 2020-04-17 18:43:14
回答 1查看 727关注 0票数 2

我正在努力解决罗莎琳德面临的“共识和形象”挑战。质疑指示如下:

给定:以FASTA格式最多收集10条相同长度(最多为1 kbp)的DNA字符串。

返回:集合的一致字符串和配置文件矩阵。(如果存在多个可能的协商一致字符串,则可以返回其中的任何一个。)

我的代码如下(我从这个网站上的另一个用户那里得到了大部分代码)。我唯一的问题是,一些DNA链被分解成多个不同的行,因此它们被作为单独的字符串附加到“allstring”列表中。我试图找出如何将每一行不包含">“的连续行写成一个字符串。

代码语言:javascript
复制
import numpy as np

seq = []
allstrings = []
temp_seq = []
matrix = []
C = []
G = []
T = []
A = []
P = []
consensus = []
position = 1

file = open("C:/Users/knigh/Documents/rosalind_cons (3).txt", "r")
conout = open("C:/Users/knigh/Documents/consensus.txt", "w")

# Right now, this is reading and writing each as an individual line. Thus, it
#  is splitting each sequence into multiple small sequences. You need to figure
#  out how to read this in FASTA format to prevent this from occurring
desc = file.readlines()

for line in desc:
    allstrings.append(line)

for string in range(1, len(allstrings)):
    if ">" not in allstrings[string]:
        temp_seq.append(allstrings[string])
    else:
        seq.insert(position, temp_seq[0])
        temp_seq = []
        position += 1

# This last insertion into the sequence must be performed after the loop to empty
#  out the last remaining string from temp_seq
seq.insert(position, temp_seq[0])

for base in seq:
    matrix.append([pos for pos in base])

M = np.array(matrix).reshape(len(seq), len(seq[0]))

for base in range(len(seq[0])):
    A_count = 0
    C_count = 0
    G_count = 0
    T_count = 0
    for pos in M[:, base]:
        if pos == "A":
            A_count += 1
        elif pos == "C":
            C_count += 1
        elif pos == "G":
            G_count += 1
        elif pos == "T":
            T_count += 1
    A.append(A_count)
    C.append(C_count)
    G.append(G_count)
    T.append(T_count)

profile_matrix = {"A": A, "C": C, "G": G, "T": T}

P.append(A)
P.append(C)
P.append(G)
P.append(T)

profile = np.array(P).reshape(4, len(A))

for pos in range(len(A)):
    if max(profile[:, pos]) == profile[0, pos]:
        consensus.append("A")
    elif max(profile[:, pos]) == profile[1, pos]:
        consensus.append("C")
    elif max(profile[:, pos]) == profile[2, pos]:
        consensus.append("G")
    elif max(profile[:, pos]) == profile[3, pos]:
        consensus.append("T")

conout.write("".join(consensus) + "\n")

for k, v in profile_matrix.items():
    conout.write(k + ": " + " ".join(str(x) for x in v) + "\n")

conout.close()
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-04-17 20:37:12

有几种方法可以将FASTA文件作为记录来迭代。您可以使用预先构建的库或编写自己的库。

用于处理序列数据的一个广泛使用的库是biopython。此代码段将创建一个字符串列表。

代码语言:javascript
复制
from Bio import SeqIO


file = "path/to/your/file.fa"
sequences = []

with open(file, "r") as file_handle:
    for record in SeqIO.parse(file_handle, "fasta"):
        sequences.append(record.seq)

或者,您可以编写自己的FASTA解析器。像这样的事情应该有效:

代码语言:javascript
复制
def read_fasta(fh):
    # Iterate to get first FASTA header        
    for line in fh:
        if line.startswith(">"):
            name = line[1:].strip()
            break

    # This list will hold the sequence lines
    fa_lines = []

    # Now iterate to find the get multiline fasta
    for line in fh:
        if line.startswith(">"):
            # When in this block we have reached 
            #  the next FASTA record

            # yield the previous record's name and
            #  sequence as tuple that we can unpack
            yield name, "".join(fa_lines)

            # Reset the sequence lines and save the
            #  name of the next record
            fa_lines = []
            name = line[1:].strip()

            # skip to next line
            continue

        fa_lines.append(line.strip())

    yield name, "".join(fa_lines)

您可以这样使用这个函数:

代码语言:javascript
复制
file = "path/to/your/file.fa"
sequences = []

with open(file, "r") as file_handle:
    for name, seq in read_fasta(file_handle):
        sequences.append(seq)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61278530

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档