文章/答案/技术大牛

发布

问测定多尿路长度
EN

Stack Overflow用户

提问于 2014-08-09 04:41:30

回答 2查看 211关注 0票数 2

如何确定/找到任何基因组中最长的多嘌呤序列(连续的As和Gs，没有C或T，反之亦然)，这需要在大肠杆菌基因组上。是为了找出多聚尿路，然后找出最长的链吗？或者它是将内含子和外显子从DNA中拼接出来？由于大肠杆菌的基因组有460万BP长，我需要一些帮助来分解它？

python

skbio

回答 2

Stack Overflow用户

发布于 2014-08-12 02:04:08

我同意这个问题的方法论方面更适合于https://biology.stackexchange.com/ (即，是否应该删除内含子/外显子等)，但简要地说，这完全取决于您试图回答的生物学问题。如果你关心这些延伸是否跨越内含子/外显子边界，那么你不应该首先拆分它们。然而，我不确定这是否与大肠杆菌序列相关，因为(据我所知)内含子和外显子是真核生物特有的。

为了解决这个问题的技术方面，这里有一些代码，说明了如何使用scikit-bio来实现这一点。(我还把这篇文章贴在了一个科学工具包--生物食谱食谱here上。)

from __future__ import print_function
import itertools
from skbio import parse_fasta, NucleotideSequence

# Define our character sets of interest. We'll define the set of purines and pyrimidines here. 

purines = set('AG')
pyrimidines = set('CTU')


# Obtain a single sequence from a fasta file. 

id_, seq = list(parse_fasta(open('data/single_sequence1.fasta')))[0]
n = NucleotideSequence(seq, id=id_)


# Define a ``longest_stretch`` function that takes a ``BiologicalSequence`` object and the characters of interest, and returns the length of the longest contiguous stretch of the characters of interest, as well as the start position of that stretch of characters. (And of course you could compute the end position of that stretch by summing those two values, if you were interested in getting the span.)

def longest_stretch(sequence, characters_of_interest):
    # initialize some values
    current_stretch_length = 0
    max_stretch_length = 0
    current_stretch_start_position = 0
    max_stretch_start_position = -1

    # this recipe was developed while reviewing this SO answer:
    # http://stackoverflow.com/a/1066838/3424666
    for is_stretch_of_interest, group in itertools.groupby(sequence, 
                                                           key=lambda x: x in characters_of_interest):
        current_stretch_length = len(list(group))
        current_stretch_start_position += current_stretch_length
        if is_stretch_of_interest:
            if current_stretch_length > max_stretch_length:
                max_stretch_length = current_stretch_length
                max_stretch_start_position = current_stretch_start_position
    return max_stretch_length, max_stretch_start_position


# We can apply this to find the longest stretch of purines...

longest_stretch(n, purines)


# We can apply this to find the longest stretch of pyrimidines...

longest_stretch(n, pyrimidines)


# Or the longest stretch of some other character or characters.

longest_stretch(n, set('N'))


# In this case, we try to find a stretch of a character that doesn't exist in the sequence.

longest_stretch(n, set('X'))

票数 3

Stack Overflow用户

发布于 2014-10-09 01:33:17

现在，在(开发版本)scikit bio中有一个用于BiologicalSequence类的方法，称为find_features (和子类)。例如

my_seq = DNASequence(some_long_string)
for run in my_seq.find_features('purine_run', min_length=10):
     print run

或

my_seq = DNASequence(some_long_string)
all_runs = list(my_seq.find_features('purine_run', min_length=10))

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25211905

复制

相似问题

问测定多尿路长度
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问测定多尿路长度EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问测定多尿路长度
EN