下面的Perl代码可以工作,但是即使有相当多的计算机资源,它也不能很好地伸缩。我希望有人能帮助我找到更有效的代码,例如用迭代代替递归,如果这是问题所在。
我的数据结构是这样的:我的%REV_ALIGN;
$REV_ALIGN{$dna}{$rna} = ();
任何dna密钥都可以有多个RNA子密钥。相同的RNA子密钥可以与多个不同的dna密钥一起出现。其目的是基于共享的dna序列元素对rna (转录本)进行分组。例如,如果dnaA有RNA1、RNA8、RNA9和RNA4,而dnaB有RNA11、RNA4和RNA99,那么我们将所有这些转录本( RNA1、RNA9、RNA4、RNA11、RNA99 )分组在一起,并继续尝试通过选择其他dna来添加到组中。我对这个问题的递归解决方案有效,但当使用来自整个基因组的数据进行转录组比对时,规模不是很好。
所以我的问题是:对于这个问题,什么是更有效的解决方案?非常感谢
my @groups;
while ( my $x =()= keys %REV_ALIGN )
{
my @DNA = keys %REV_ALIGN;
my $dna = shift @DNA;
# the corresponding list of rna
my @RNA = keys %{$REV_ALIGN{$dna}};
delete $REV_ALIGN{$dna};
if ( $x == 1 )
{
push @groups, \@RNA;
last;
}
my $ref = group_transcripts ( \@RNA, \%REV_ALIGN );
push @groups, $ref;
}
sub group_transcripts
{
my $tran_ref = shift;
my $align_ref = shift;
my @RNA_A = @$tran_ref;
my %RNA;
# create a null hash with seed list of transcripts
@RNA{@RNA_A} = ();
# get a list of all remaining dna sequences in the alignment
my @DNA = keys %{$align_ref};
my %count;
# select a different list of transcripts
for my $dna ( @DNA )
{
next unless exists $align_ref->{$dna};
my @RNA_B = keys %{$align_ref->{$dna}};
# check to see two list share and transcripts
for my $element ( @RNA_A, @RNA_B )
{
$count{$element}++;
}
for my $rna_a ( keys %count )
{
# if they do, add any new transcripts to the current group
if ( $count{$rna_a} == 2 )
{
for my $rna_b ( @RNA_B )
{
push @RNA_A, $rna_b if $count{$rna_b} == 1;
}
delete $align_ref->{$dna};
delete $count{$_} foreach keys %count;
# recurse to try and continue adding to list
@_ = ( \@RNA_A, $align_ref );
goto &group_transcripts;
}
}
delete $count{$_} foreach keys %count;
}
# if no more transcripts can be added, return a reference to the group
return \@RNA_A;
}发布于 2015-10-07 03:33:29
您有一个嵌套了四个深度的循环。我敢打赌,这就是你的代码伸缩性差的原因。
如果我正确地理解了您想要完成的任务,那么输入
my %REV_ALIGN = (
"DNA1" => { map { $_ => undef } "RNA1", "RNA2" }, # \ Linked by RNA1 \
"DNA2" => { map { $_ => undef } "RNA1", "RNA3" }, # / \ Linked by RNA3 > Group
"DNA3" => { map { $_ => undef } "RNA3", "RNA4" }, # / /
"DNA4" => { map { $_ => undef } "RNA5", "RNA6" }, # \ Linked by RNA5 \ Group
"DNA5" => { map { $_ => undef } "RNA5", "RNA7" }, # / /
"DNA6" => { map { $_ => undef } "RNA8" }, # > Group
);应该会导致
my @groups = (
[
dna => [ "DNA1", "DNA2", "DNA3" ],
rna => [ "RNA1", "RNA2", "RNA3", "RNA4" ],
],
[
dna => [ "DNA4", "DNA5" ],
rna => [ "RNA5", "RNA6", "RNA7" ],
],
[
dna => [ "DNA6" ],
rna => [ "RNA8" ],
],
);如果是这样,您可以使用以下命令:
use strict;
use warnings;
use Graph::Undirected qw( );
my %REV_ALIGN = (
"DNA1" => { map { $_ => undef } "RNA1", "RNA2" },
"DNA2" => { map { $_ => undef } "RNA1", "RNA3" },
"DNA3" => { map { $_ => undef } "RNA3", "RNA4" },
"DNA4" => { map { $_ => undef } "RNA5", "RNA6" },
"DNA5" => { map { $_ => undef } "RNA5", "RNA7" },
"DNA6" => { map { $_ => undef } "RNA8" },
);
my $g = Graph::Undirected->new();
for my $dna (keys(%REV_ALIGN)) {
for my $rna (keys(%{ $REV_ALIGN{$dna} })) {
$g->add_edge("dna:$dna", "rna:$rna");
}
}
my @groups;
for my $raw_group ($g->connected_components()) {
my %group = ( dna => [], rna => [] );
for (@$raw_group) {
my ($type, $val) = split(/:/, $_, 2);
push @{ $group{$type} }, $val;
}
push @groups, \%group;
}
use Data::Dumper qw( Dumper );
print(Dumper(\@groups));如果您只需要RNA,最后一节将简化为以下内容:
my @groups;
for my $raw_group ($g->connected_components()) {
my @group;
for (@$raw_group) {
my ($type, $val) = split(/:/, $_, 2);
push @group, $val if $type eq 'rna';
}
push @groups, \@group;
}https://stackoverflow.com/questions/32975805
复制相似问题