我对Perl和Bioperl相当陌生,我正在尝试编写一个脚本来识别相同序列的实例。为了实现这一点,我设想了一个脚本,它包含2个文件,第一个是fasta格式的多重对齐,第二个是一个附件文件,它将fasta in链接到其他相关信息。我的方法是使用Bio::SeqIO读取多个对齐,并将文件内容放在散列中,其中序列是键,id是值,或者在序列共享的情况下,id数组是值。
我觉得应该是这样的:
"AATTTGTTGTTGTACC“=> (‘Seq1 1’,'Seq13'),
"TTTCTCTTTCCCAAAG“=>‘Seq2 2’,
目前,我认为我被困住了,因为在序列共享的情况下,试图将第二个id推到数组上是错误的。(以上示例中的'Seq13‘)。
下面是我正在使用的测试多重对齐方式:
>Seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>Seq2
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>Seq13
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA下面是我到目前为止编写的代码:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Data::Dumper;
my $seqs = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
my $info = shift @ARGV or die "please provide a multiple alignment file and an accesory information file: $!\n";
#open(INFO, '<', $info);
my $inseq = Bio::SeqIO->new(
-file => $seqs,
-format => "fasta",
);
my %hts;
while (my $seq = $inseq->next_seq) {
# print $seq->seq(), "\t", $seq->id, "\n";
if (defined $hts{$seq->seq()}) {
print "Sequence already in hash:\t$seq->id\n";
push @{$hts{$seq->seq()}}, ${$seq->id};
}
else {
$hts{$seq->seq()} = $seq->id;
}
print Dumper \%hts
}以下是我希望得到的一些帮助
1)我收到了一个我不太理解的错误,但相信它与push语句有关-->不能使用字符串("Seq1")作为数组引用,而在ht_sharing.pl第24行第3行使用“严格参考”。
2)当if循环外的print语句处于活动状态时,它按我所认为的那样打印id (即Seq1),但是在if循环内的print语句中,相同的调用$seq->id将生成一个引用(即Bio::Seq=HASH(0x19e7210)->id)。为什么会这样呢?我不明白为什么打印$seq->id在同一个while循环中有不同的输出。
如果有人能提供澄清的话,我将非常感激,当然,由于一些人对此还很陌生,关于最佳实践的评论或者更好的方法来解决这个问题也是很好的。
干杯,安娜
发布于 2013-11-11 23:56:02
您的代码非常接近,但是有几个小问题。第一个问题是,要使用语法if (exists $hash{$key}) { ... }来查看键是否存在,defined将告诉您是否定义了该值。第二件事是,您正在无缘无故地取消对$seq对象的引用。
当您在Bio::SeqIO对象上调用'next_seq‘方法时,它返回一个Bio::Seq对象。如果在Bio::Seq对象上调用' ID‘方法,它将按预期返回ID,因此不需要执行任何操作。此外,没有必要显式导入Bio::Seq (这只是一个注释,而不是问题)。
其他评论:
print Dumper %hts;调用放在while (my $seq ...)循环之后(即在遍历了所有seq对象之后)。在这里,当您正在浏览该文件时,转储哈希并不能提供太多的信息。$hts{$seq->seq}++,然后查看排序的值,看看是否有重复的值。那样会更快。https://stackoverflow.com/questions/19895789
复制相似问题