文章/答案/技术大牛

发布

社区首页 >问答首页 >生物同步比对的无重叠索引

问生物同步比对的无重叠索引
EN

Stack Overflow用户

提问于 2017-10-02 23:50:49

回答 3查看 441关注 0票数 2

我第一次使用生物电影。如果这是一个基本问题，请原谅我。

我想输入序列，然后对齐它们，并能够引用原始序列的索引位置(未执行)和对齐序列(间隙)。

我的实际例子是enolase (Uniprot、P37869和P0A6P9)。底物结合赖氨酸在大肠杆菌中为392，在枯草杆菌中为389。如果对两者进行肌肉排列，那么这个赖氨酸在排列中的指数是394。我想要能够很容易地转换之间的庸俗和未被起诉。

示例1:大肠杆菌残留物#392的对齐指数是多少？(按对齐顺序回答394 )。

示例2我在394的对齐中发现了一个保守的残基。在最初的(无间隙)序列中那是在哪里？在大肠杆菌中回答392，在枯草杆菌中回答389 )。

>>>cline = MuscleCommandline(input="enolase.txt", out="enolase.aln")
>>>cline()

>>> alignment = AlignIO.read("enolase.aln","fasta")
>>> print(alignment[:,390:])
SingleLetterAlphabet() alignment with 2 rows and 45 columns
AGQIKTGAPSRTDRVAKYNQLLRIEDQLAETAQYHGINSFYNLNK sp|P37869|ENO_BACSU
AGQIKTGSMSRSDRVAKYNQLIRIEEALGEKAPYNGRKEIKGQA- sp|P0A6P9|ENO_ECOLI
>>> print(alignment[:,394])
KK

sequence-alignment

python

bioinformatics

biopython

回答 3

Stack Overflow用户

发布于 2017-10-03 11:45:32

有趣的问题！据我所知，BioPython中并没有内置的东西。这就是我要解决的问题。

让我们从您的示例2开始。如果您以FASTA格式分别使用原始无重叠序列和对齐序列的两个文件enolase.txt和enolase.aln，那么我们可以循环遍历经过压缩的记录，计算对齐序列中的间隙数，并计算无重叠序列中残差的索引：

from Bio import SeqIO, AlignIO

def find_in_original(index, original_path, alignment_path):
    alignment = AlignIO.read(alignment_path, 'fasta')
    original = SeqIO.parse(original_path, 'fasta')

    for original_record, alignment_record in zip(original, alignment):
        alignment_seq = str(alignment_record.seq)
        original_seq = str(original_record.seq)

        gaps = alignment_seq[:index].count('-')
        original_index = len(alignment_seq[:index]) - gaps
        assert alignment_seq[index] ==  original_seq[original_index]
        yield ("The conserved residue {} at location {} in the alignment can be"
               " found at location {} in {}.".format(alignment_seq[index],
               index, original_index, original_record.id.split('|')[-1]))

其结果是：

>>> for result in  find_in_original(394, 'enolase.txt', 'enolase.aln'):
...     print result
The conserved residue K at location 394 in the alignment can be found at location 392 in ENO_ECOLI.
The conserved residue K at location 394 in the alignment can be found at location 389 in ENO_BACSU.

对于反向操作，我们查看对齐中所有可能的索引，并查看如果减去间隙，哪一个等于无间隔序列：

def find_in_alignment(index, organism, original_path, alignment_path):
    alignment = AlignIO.read(alignment_path, 'fasta')
    original = SeqIO.parse(original_path, 'fasta')

    for original_record, alignment_record in zip(original, alignment):
        if organism in original_record.id:
            alignment_seq = str(alignment_record.seq)
            original_seq = str(original_record.seq)

            residue = original_seq[index]
            for i in range(index, len(alignment_seq)):
                if alignment_seq[i] == residue and \
                   alignment_seq[:i].replace('-', '') == original_seq[:index]:
                    return ("The residue {} at location {} in {} is at location"
                            " {} in the alignment.".format(residue, index,
                            organism, i))

其结果是：

>>> print find_in_alignment(392, 'ENO_ECOLI', 'enolase.txt', 'enolase.aln')
The residue K at location 392 in ENO_ECOLI is at location 394 in the alignment.
>>> print find_in_alignment(389, 'ENO_BACSU', ungapped_path, alignment_path)
The residue K at location 389 in ENO_BACSU is at location 394 in the alignment.

票数 2

Stack Overflow用户

发布于 2020-12-03 01:50:19

def original_index_to_aln_index(aln_seq, index):
    ''' given aln seq and original 0-based index, return new index'''
    pos = index 
    n_char_before = pos - aln_seq[:pos].count('-') # how many character before position
    while n_char_before < index: # while it has not reach the number of char before
        missed_char = index - n_char_before
        pos = pos +  missed_char# if it has not reach, shift at least lacked char
        n_char_before = pos - aln_seq[:pos].count('-')
        
    
    while aln_seq[pos] == '-': # if the last character is gap, shift 1 until meet char
        pos += 1
    return pos
    ```

票数 0

Stack Overflow用户

发布于 2022-04-18 20:48:44

与其循环遍历，您可以简单地通过逻辑索引计算差距来生成索引。在这里，空白用“-”来表示。

>>> x = np.asarray(list('AGQ--IKTGA-LAETAQYHG-INS-FYNL-NK'))  # an aligned sequence
>>> x
array(['A', 'G', 'Q', '-', '-', 'I', 'K', 'T', 'G', 'A', '-', 'L', 'A',
       'E', 'T', 'A', 'Q', 'Y', 'H', 'G', '-', 'I', 'N', 'S', '-', 'F',
       'Y', 'N', 'L', '-', 'N', 'K'], dtype='<U1')

索引后对齐与对齐序列的长度只是0：

>>> idx_post = np.arange(x.size)
>>> idx_post 
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])

对于预对准，我们需要跳过空隙。这里，我生成一个与对齐序列大小相同的nan矩阵。然后将原始序列=/=间隙字符的值替换为它的索引。

>>> idx_pre = np.nan * np.ones(x.size)
>>> idx_pre
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan])
>>> idx_pre[x != '-'] = np.arange(np.sum(x != '-'))
>>> idx_pre
array([ 0.,  1.,  2., nan, nan,  3.,  4.,  5.,  6.,  7., nan,  8.,  9.,
       10., 11., 12., 13., 14., 15., 16., nan, 17., 18., 19., nan, 20.,
       21., 22., 23., nan, 24., 25.])

这就给您留下了两个与对齐矩阵大小相同的索引矩阵，使您可以在对齐之前轻松地找到前一个索引。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/46535260

复制

相似问题

问生物同步比对的无重叠索引
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问生物同步比对的无重叠索引EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问生物同步比对的无重叠索引
EN