首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在DNA序列中找到所有重复的4-mers - Perl。

在DNA序列中找到所有重复的4-mers - Perl。
EN

Stack Overflow用户
提问于 2017-06-28 07:59:44
回答 1查看 525关注 0票数 4

你好,

我试图编写一个程序,它读取一个包含多个DNA序列的FASTA格式文件,识别一个序列中所有重复的4-mers (即所有多次发生的4-mers ),并打印出重复的4-mer和在其中找到它的序列的头。k-mer只是一个k核苷酸序列(例如“aaca”、“gacg”和“tttt”是4-mers).

这是我的密码:

代码语言:javascript
复制
use strict;
use warnings;

my $count = -1;
my $file = "sequences.fa";
my $seq = '';
my @header = ();
my @sequences = ();
my $line = '';
open (READ, $file) || die "Cannot open $file: $!.\n";

while ($line = <READ>){
    chomp $line;
    if ($line =~ /^>/){
        push @header, $line;
        $count++;
        unless ($seq eq ''){
            push @sequences, $seq;
            $seq = '';
        }
    } else {
        $seq .= $line;
    }
}   push @sequences, $line;

for (my $i = 0; $i <= $#sequences+1; $i++){
    if ($sequences[$i] =~ /(....)(.)*\g{1}+/g){
        print $header[$i], "\n", $&, "\n";
    }
}

我有两个请求:首先,我不知道如何设计regex模式以获得所需的输出。其次,更重要的是,我确信我的代码效率很低,所以如果有办法缩短它,请告诉我。

提前感谢!

下面是一个FASTA文件的示例:(注意,序列之间有一条额外的行,在原始fasta文件中不是这样)

全基因组NC_001422.1肠杆菌噬菌体phiX174 sensu 全基因组NC_001501.1肠杆菌噬菌体phiX184 sensu 全基因组

EN

回答 1

Stack Overflow用户

发布于 2017-06-28 09:05:50

我可能会更像这样处理你的问题:

代码语言:javascript
复制
#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

#set paragraph mode. Iterate on blank lines. 
local $/ = ''; 

#read from STDIN or a file specified on command line, 
#e.g. cat filename_here | myscript.pl
#or myscript.pl filename_here
while ( <> ) {
   #capture the header line, and then remove it from our data block
   my ($header) = m/\>(.*)/;
   s/>.*$//;

   #remove linefeeds and whitespace. 
   s/\s*\n\s*//g;
   #use lookahead pattern, so the data isn't 'consumed' by the regex. 
   my @sequences = m/(?=([atcg]{4}))/gi;

   #increment a count for each sequence found. 
   my %count_of;
   $count_of{$_}++ for @sequences;

   #print output. (Modify according to specific needs. 
   print $header,"\n";

   print "Found sequences:\n";
   print Dumper \@sequences;
   print "Count:\n";
   print Dumper \%count_of;

   #note - ordered, but includes duplicates. 
   #you could just use keys  %count_of, but that would be unordered. 
   foreach my $sequence ( grep { $count_of{$_} > 1 } @sequences ) {
      print $sequence, " => ", $count_of{$sequence},"\n";
   }
   print "\n";
}

我们通过记录迭代记录,捕获和删除“头”行,然后将其余部分拼接在一起。然后捕获4的每个(重叠)序列,并对它们进行计数。

这对于您的示例数据(第一节表示简洁):

代码语言:javascript
复制
NC_001422.1 Enterobacteria phage phiX174 sensu lato, complete genome 
Found sequences:
    GAGT => 2
    AGTT => 2
    TTAT => 2
    CATG => 2
    ATGA => 3
    TGAC => 2
    CGCA => 2
    AGTT => 2
    ACTT => 2
    tttt => 3
    tttt => 3
    tttt => 3
    GGAT => 2
    GATA => 2
    ATAT => 2
    TATT => 2
    ATGA => 3
    TGAG => 2
    GAGT => 2
    AAAA => 2
    AAAA => 2
    ACTT => 2
    TGAG => 2
    GGAT => 2
    GATA => 2
    tata => 2
    tata => 2
    TTAT => 2
    TATG => 2
    ATAT => 2
    TATT => 2
    GCCG => 2
    TATG => 2
    GCCG => 2
    CGCA => 2
    CATG => 2
    ATGA => 3
    TGAC => 2

注意-因为它基于原始序列,它基于数据中的排序,您将在那里看到TGAC两次,因为.在里面放了两次。

但是,您可以选择:

代码语言:javascript
复制
   foreach my $sequence ( sort { $count_of{$b} <=> $count_of{$a} }
                          grep { $count_of{$_} > 1 } 
                                 keys %count_of ) {
      print $sequence, " => ", $count_of{$sequence},"\n";
   }
   print "\n";

它将丢弃少于2匹配的任何匹配,并按频率排序。

票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/44796788

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档