我有一个包含blast输出的巨大文件,我需要选择查询ID、主题gi和框架(基本上是整行),而最小的e值忽略了重复行(省略了其他较高e值的所有其他行)。文件的外观如下:
# BLASTX 2.2.28+
# 0 hits found
# BLASTX 2.2.28+
# Query: Tx6_c1_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 24 hits found
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1 9 173 224 282 1e-06 65 32.20 3/0
Tx6_c1_seq1 577332 RecName: Full=Putative E3 ubiquitin-protein ligase SINAT1; AltName: Full=Seven in absentia homolog 1 305 1 9 179 111 171 3e-05 67 32.79 3/0
Tx6_c1_seq1 3548505 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 419 2 9 173 209 267 8e-05 65 32.20 3/0
Tx6_c1_seq1 577547 RecName: Full=E3 ubiquitin-protein ligase siah2; AltName: Full=Seven in absentia homolog 2; AltName: Full=Xsiah-2 313 1 15 173 125 181 2e-04 62 29.82 3/0
Tx6_c1_seq1 577417 RecName: Full=E3 ubiquitin-protein ligase Siah1; AltName: Full=Seven in absentia homolog 1; Short=Siah-1 282 1 15 173 96 152 3e-04 62 29.82 3/0
Tx6_c1_seq1 577554 RecName: Full=E3 ubiquitin-protein ligase SINAT2; AltName: Full=Seven in absentia homolog 2 308 1 9 179 114 174 4e-04 67 31.15 3/0
# BLASTX 2.2.28+
# Query: Tx_11_c0_seq1
# Database: /mnt/swissprot
# Fields: query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames
# 1 hits found
Tx_11_c0_seq1 977285 RecName: Full=120.7 kDa protein in NOF-FB transposable element 1056 15 957 28 147 455 8e-13 79 27.81 -2/0
# BLASTX 2.2.28+
# Query: Tx_11_c1_seq1本例中的预期输出应该仅为这两行,因为它们具有最小的e_value:
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1 9 173 224 282 1e-06 65 32.20 3/0
Tx_11_c0_seq1 977285 RecName: Full=120.7 kDa protein in NOF-FB transposable element 1056我已经编写了我的代码,但似乎不起作用。你们能帮我解决这个问题吗。我非常感谢你的时间和帮助。到目前为止,这就是我所拥有的:
#!/usr/bin/perl -w
# Author:
# 01/07/2014
# This script removes duplicate records from a "short" format BLAST output file, and keeps only the "best" records (sorts by smallest e-value and then biggest percent identity)
# Usage: bestblast.pl <input file> <output file>
#-----------------------------------------------------------------------------------------------------------------------------------------
#Deal with passed parameters
#-----------------------------------------------------------------------------------------------------------------------------------------
#If no arguments are passed, show usage message and exit program.
if ($#ARGV == -1) {
usage("BLAST BEST 1.0 2014");
exit;
}
#get the names of the input file (first argument passed) and output file (second argument passed)
$in_file = $ARGV[0];
$out_file = $ARGV[1];
#Open the input file for reading, open the output file for writing.
#If either are unsuccessful, print an error message and exit program.
unless ( open(IN, "$in_file") ) {
usage("Got a bad input file: $in_file");
exit;
}
unless ( open(OUT, ">$out_file") ) {
usage("Got a bad output file: $out_file");
exit;
}
#Everything looks good. Print the parameters we've found.
print "Parameters:\ninput file = $in_file\noutput file = $out_file\n\n";
#-----------------------------------------------------------------------------------------------------------------------------------------
#The main event
#-----------------------------------------------------------------------------------------------------------------------------------------
$counter = 0;
$total_counter = 0;
print "De-duplicating File...\n";
@in = <IN>;
#Do stuff for each line of text in the input file.
foreach $line (@in) {
#if the line starts with a pound symbol, it is not real data, so skip this line.
if ( $line =~ /^#/ ) {
next;
}
#Count the total number of data lines in the file.
$total_counter++;
#The chomp commands removes any new line (and carriage return) characters from the end of the line.
chomp($line);
#Split up the tab delimited line, naming only the variables we are interested in (i.e. query id, subject gi, subject title, subject length, gap opens, q. start, q. end, s. start, s. end, evalue, % subject coverage, % identity, query/sbjct frames)
($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $evalue, $subject_coverage, $identity, $query_sbjct_frames) = split(/\t/, $line);
#check to see if the id label is already in the list of ids (called dedupe)
#if its not there, add it.
if ( $dedupe{$query_id} ) {
#if it is, look at the old line to see if it is still "better" than the new one.
($query_id, $subject_gi, $subject_title, $subject_length, $gap_opens, $q_start, $q_end, $s_start, $s_end, $list_evalue, $subject_coverage, $list_identity, $query_sbjct_frames) = split(/\t/,$dedupe{$query_id});
#if the new evalue is better than the old one, change the value of this id to the new line.
#otherwise, if the new evalue is the same, and the percent_identity is better, change the value of this id to the new line.
#otherwise, don't do anything (keep the old line).
if ( $evalue < $list_evalue ) {
$dedupe{$query_id} = $line;
}
elsif ( $evalue == $list_evalue ) {
if ( $identity > $list_identity ) {
$dedupe{$query_id} = $line;
}
}
}
else {
$dedupe{$query_id} = $line;
#count the number of non-duplicated lines we have.
$counter++;
}
}
print "Total # records = $total_counter\nBest only # records = $counter\n";
print "Writing to output file...\n";
#Print the final "dedupe" list to the new file (adding the new line back on the end).
foreach $query_id (sort keys %dedupe) {
print OUT "$dedupe{$query_id}\n";
}
#Close the files.
close(IN);
close(OUT);
print "Done.\n";
#-----------------------------------------------------------------------------------------------------------------------------------------
#Subroutines
#-----------------------------------------------------------------------------------------------------------------------------------------
sub usage {
my($message) = @_;
print "\n$message\n";
print "\nThis script removes duplicate records from a \"short\" format BLAST output file, and keeps only the \"best\" records.\nIt sorts by smallest e-value and then biggest percent identity.\n";
print "Usage: bestbenter code herelast.pl <input file> <output file>\n";
print "\n Author \n";
print "01/07/2014\n";
}发布于 2014-01-08 19:14:03
试着在顶部添加“使用严格”一词--它可能会帮助定位其他的东西。
尝试将"if($dedupe{$query_id})“替换为”if(定义($dedupe{$query_id}))“
请记住,大多数这样的人不是生物学家/基因组学家(!)也不知道你在说什么,我们看到的只是数字和文字对我们来说毫无意义,所以如果你能解释得更好,我们也许能提供更多的帮助。
顺便说一句,以下是更多的习语。
next if $line =~ /^#/;您的代码总是从第64行到第81行。它从来没有进入二次测试-它从来没有找到重复的。尝试在调试器中运行它,如下所示:
perl -d yourprog INFILE OUTFILE然后重复执行"n“表示”下一行“。您可以使用"p变量名“打印变量值。
如果我将拆分()的分隔符从选项卡更改为空格,则得到以下内容--至少有正确的输出记录数:
De-duplicating File...
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "AltName:" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "in" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "Full=Seven" isn't numeric in numeric lt (<) at ./go line 71, <IN> line 21.
Argument "homolog" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Argument "absentia" isn't numeric in numeric gt (>) at ./go line 75, <IN> line 21.
Total # records = 7
Best only # records = 2
Writing to output file...
Done.
iMac:~/tmp: more out
Tx6_c1_seq1 6439823 RecName: Full=E3 ubiquitin-protein ligase siah-1; AltName: Full=Seven in absentia homolog 1 434 1 9 173 224 282 1e-06 65 32.20 3/0
Tx_11_c0_seq1 977285 RecName: Full=120.7 kDa protein in NOF-FB transposable element 1056 15 957 28 147 455 8e-13 79 27.81 -2/0发布于 2014-01-08 19:24:39
你为什么要写你自己的爆炸解析器。使用BioPerl
problems
我不再过多地使用Perl了,但是下面是关于该做什么的粗略想法
while (my $result = $report->next_result) {
print "Query: ".$result->query_name."\n";
while (my $hit = $result->next_hit) {
while ($hsp = $hit->next_hsp) {
my evalue = $hsp->evalue;
#convert to decimal notation
$decimal_notation = sprintf("%.10g", $scientific_notation);
##... i'll leave the rest up to you
}
}
}发布于 2014-01-08 19:19:37
evalue采用科学的表示法,perl在进行比比较少的操作时会把它当作字符串对待。
我想我也会做些不同的事.
https://stackoverflow.com/questions/21004024
复制相似问题