我想从第四列组织的制表符分隔的数据数组中创建一个移动的总和窗口。为了简单起见,我用X替换了不相关的字段,并添加了第一行中看到的标题:
ID-Counts X X Start X X Locations XXXX
X-5000 [X] [X] 0 [X] [X] 1 [X...]
X-26 [X] [X] 1 [X] [X] 1 [X...]
X-34 [X] [X] 1 [X] [X] 0 [X...]
X-3 [X] [X] 20 [X] [X] 9 [X...]
X-200 [X] [X] 30 [X] [X] 0 [X...]
X-1 [X] [X] 40 [X] [X] 5 [X...]第一列包含一个数字ID,并用连字符连接该ID的计数。第四列包含我想用来对数据进行分组的所有起始站点。第七列包含我需要归一化计数的位置数。
我想要对每一行求和的总和是通过将ID中的计数除以位置数+1来确定的(例如,第一行的值为2500,第2行的值为13,第3行的值为34)。然后,我想对列4中的值在20个单位内的每一行的这些计数/(locations+1)求和,从值0-19开始,然后是1-20,2-21,等等。例如,窗口0(列4的值范围为0-19)将对行1-3求和,窗口1将对行2-4求和,窗口2将仅对行4求和,依此类推。
我的理想输出应该是两列:第一列是20unit-window的开头(0,1,2,...)并且第二个具有每个窗口的和(在上面的数据2547、47.3等中)。
我编写了一个perl脚本,将数据过滤和组织成这种格式,并希望为20unit窗口中的求和添加代码。作为一名perl新手,我非常感谢任何帮助和解释。我熟悉跨列的拆分和算术函数,但我完全不知道如何跨数组中的移动窗口执行这些操作。谢谢。
发布于 2012-11-21 10:45:54
我希望我能很好地理解你的问题。你对这些实现有什么看法?
解决方案1:每次输出文件到达单元窗口(20)时写入输出文件。
#Assuming that you have an array of sums (@sums) and name of file ($filename)
my $window_no = 20;
my $window_sum = 0;
my @window_nos = ();
for (my $i = 1; $i <= $#sums; $i++) {
push (@window_nos, $i);
if ( i % window_no == 0 ) {
write_file($filename, join(',', @window_nos) . "\t" . $window_sum . "\n");
$window_sum = 0;
@window_nos = ();
}
}
if (scalar @window_nos > 1) {
write_file($filename, join(',', @window_nos) . "\t" . $window_sum) . "\n");
} 解决方案2:将值附加到标量变量,并使用该变量向输出文件写入一次。
#Assuming that you have an array of sums (@sums) and name of file ($filename)
my $window_no = 20;
my $window_sum = 0;
my @window_nos = ();
my $file_contents = '';
for (my $i = 1; $i <= $#sums; $i++) {
push (@window_nos, $i);
if (i % window_no == 0) {
$file_contents .= join(',', @window_nos) . "\t" . $window_sum . "\n";
$window_sum = 0;
@window_nos = ();
}
}
if (scalar @window_nos > 1) {
$file_contents .= join(',', @window_nos) . "\t" . $window_sum . "\n";
}
write_file($filename, $file_contents);发布于 2012-11-22 08:30:30
看一下下面的代码,看看它是否做了您想要的事情。可能会有一些优化,但我基本上是在当前开始之上的20个单位窗口内对所有开始进行了暴力搜索。
肯
输出:
0-19: 2547.000000
1-20: 47.300000
20-39: 200.300000
30-49: 200.166667
40-59: 0.166667代码
use strict;
use warnings;
# Hash indexed by Start
# Each value contains the sum of all ( Counts/Locations+1 ) for
# this Start value
my %sum;
while (<DATA>)
{
# ignore comments
next if /^\s*#/;
my ( $id_count,undef,undef,$start,undef,undef,$numLocations ) =
split ' ';
my ($id,$count) = split '-',$id_count;
$sum{$start} += $count / ( $numLocations + 1 );
}
foreach my $start ( sort keys %sum )
{
my $totalSum = 0;
# Could probably be optimized.
foreach my $start2 ( $start .. $start+19 )
{
$totalSum += $sum{$start2} if defined($sum{$start2});
}
printf "%d-%d: %f\n", $start, $start+19, $totalSum;
}
__DATA__
#ID-Counts X X Start X X Locations XXXX
X-5000 [X] [X] 0 [X] [X] 1 [X...]
X-26 [X] [X] 1 [X] [X] 1 [X...]
X-34 [X] [X] 1 [X] [X] 0 [X...]
X-3 [X] [X] 20 [X] [X] 9 [X...]
X-200 [X] [X] 30 [X] [X] 0 [X...]
X-1 [X] [X] 40 [X] [X] 5 [X...]发布于 2012-11-23 00:25:28
这个怎么样?
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
my %sum_for;
while ( my $line = <DATA> ) {
if ( $line !~ m{\A [#] }xms ) {
$line =~ s{\A \s* ( [^-]+ ) - }{$1 }xms; # separate the ID
my @columns = split /\s+/, $line; # assumes no space in values
my $count = $columns[1];
my $start = $columns[4];
my $locat = $columns[7] + 1;
$sum_for{$start} += $count / $locat;
}
}
print Dumper( \%sum_for );
my @start_ranges;
{
my ($max_start) = sort { $b <=> $a } keys %sum_for;
# max => range count
# 10 => 1
# 20 => 2
# 30 => 2
# 40 => 3
# 50 => 3
# ...
my $range_count = $max_start / 20;
push @start_ranges, [ 0, 19 ];
for ( 1 .. $range_count ) {
push @start_ranges, [ map { $_ + 20 } @{ $start_ranges[-1] } ];
}
}
my %total_for;
for my $range_ra (@start_ranges) {
my $range_key = sprintf '%d-%d', @{$range_ra};
for my $start ( $range_ra->[0] .. $range_ra->[1] ) {
if ( exists $sum_for{$start} ) {
$total_for{$range_key} += $sum_for{$start};
}
}
}
print Dumper( \%total_for );
__DATA__
#ID-Counts X X Start X X Locations XXXX
X-5000 [X] [X] 0 [X] [X] 1 [X...]
X-26 [X] [X] 1 [X] [X] 1 [X...]
X-34 [X] [X] 1 [X] [X] 0 [X...]
X-3 [X] [X] 20 [X] [X] 9 [X...]
X-200 [X] [X] 30 [X] [X] 0 [X...]
X-1 [X] [X] 40 [X] [X] 5 [X...]输出结果如下:
$VAR1 = {
'1' => 47,
'40' => '0.166666666666667',
'0' => 2500,
'30' => 200,
'20' => '0.3'
};
$VAR1 = {
'40-59' => '0.166666666666667',
'20-39' => '200.3',
'0-19' => 2547
};关于计算起始范围的部分需要一些思考。谢谢你这个有趣的问题。
https://stackoverflow.com/questions/13484336
复制相似问题