首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >由于大数组而导致脚本失败

由于大数组而导致脚本失败
EN

Stack Overflow用户
提问于 2014-10-10 16:28:55
回答 2查看 139关注 0票数 0

我的当前脚本通过PDB中的数据运行,并将它们存储到数组中,然后将这些数组用于脚本的其余部分。这个脚本在一个小PDB文件上运行得很好,但是当我使用一个真正的PDB文件时,我只使用一个PDB文件上的所有计算机内存。我有2000份PDB文件,我需要计算。

这是我目前的完整剧本,有几个笔记。

完整脚本:

代码语言:javascript
复制
#!/usr/bin/perl

use warnings;
use strict;

#my $inputfile = $ARGV[0];
#my $inputfile = '8ns_emb_alt_test.pdb';
my $inputfile = '8ns_emb_alt_101.pdb';

open( INPUTFILE, "<", $inputfile ) or die $!;

my @array = <INPUTFILE>;

### Protein
my $protein = 'PROT';
my @protx;
my @proty;
my @protz;

for ( my $line = 0; $line <= $#array; ++$line ) {
    if ( ( $array[$line] =~ m/\s+$protein\s+/ ) ) {
        chomp $array[$line];
        my @splitline = ( split /\s+/, $array[$line] );
        push @protx, $splitline[5];    #This has 2083 x-coordinates
        push @proty, $splitline[6];    #this has 2083 y-coordinates
        push @protz, $splitline[7];    #this has 2083 z-coordinates
    }
}

### Lipid
my $lipid1 = 'POPS';
my $lipid2 = 'POPC';
my @lipidx;
my @lipidy;
my @lipidz;

for ( my $line = 0; $line <= $#array; ++$line ) {
    if ( ( $array[$line] =~ m/\s+$lipid1\s+/ ) || ( $array[$line] =~ m/\s+$lipid2\s+/ ) ) {
        chomp $array[$line];
        my @splitline = ( split /\s+/, $array[$line] );
        push @lipidx, $splitline[5];    #this has approximately 35,000 x coordinates
        push @lipidy, $splitline[6];    #same as above for y
        push @lipidz, $splitline[7];    #same as above for z
    }
}

### Calculation
my @deltaX = map {
    my $diff = $_;
    map { $diff - $_ } @lipidx
} @protx;    #so this has 2083*35000 data x-coordinates
my @squared_deltaX = map { $_ * $_ } @deltaX;    #this has all the x-coordinates squared from @deltaX

my @deltaY = map {
    my $diff = $_;
    map { $diff - $_ } @lipidy
} @proty;
my @squared_deltaY = map { $_ * $_ } @deltaY;

my @deltaZ = map {
    my $diff = $_;
    map { $diff - $_ } @lipidz
} @protz;
my @squared_deltaZ = map { $_ * $_ } @deltaZ;

my @distance;
for ( my $ticker = 0; $ticker <= $#array; ++$ticker ) {
    my $distance_calc = sqrt( ( $squared_deltaX[$ticker] + $squared_deltaY[$ticker] + $squared_deltaZ[$ticker] ) );
    push @distance, $distance_calc;
}    #this runs the final calculation and computes all the distances between the atoms

### The Hunt
my $limit = 5;
my @DistU50;
my @resid_tagger;

for ( my $tracker = 0; $tracker <= $#array; ++$tracker ) {
    my $dist = $distance[$tracker];
    if ( ( $dist < $limit ) && ( $array[$tracker] =~ m/\s+$protein\s+/ ) ) {
        my @splitline = ( split /\s+/, $array[$tracker] );
        my $LT50 = $dist;
        push @resid_tagger, $splitline[4];    #selects stores a selected index number
        push @DistU50,      $LT50;            #stores the values within the $limit
    }
} #this 'for' loop search for all the elements in the '@distance' and pushes them to the final arrays and also makes an array of certain index numbers in to another array.

### Le'Finali
print "@resid_tagger = resid \n";
print "5 > @DistU50 \n";

close INPUTFILE;

我的一位实验室朋友说,我可以将一些数据存储在文件中,这样就可以减少内存。我认为这是个好主意,但我不知道哪里是最有效率的地方,我要做多少次。我这样做是因为这是我知道如何做到这一点的最好方法。

如果有人能给我举一个例子,说明我可以把数组转换成文件,然后再次使用该文件中的数据,那将是非常有帮助的。否则,如果有人有一个想法,我可以查,事情可以尝试或只是建议,至少会让我开始在某个地方。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-10-10 17:48:05

编辑.

你们两个..。我们应该先检查一下。您可能需要查看已经存在的用于处理和操作PDB数据的perl模块。

  • http://search.cpan.org/~rulix/Bio-PDB-Structure-0.02/lib/Bio/PDB/Structure.pm
  • http://www.iu.a.u-tokyo.ac.jp/~tterada/softwares/pdb.html
  • http://www.perlmol.org/pod/Chemistry/File/PDB.html
  • http://comp.chem.nottingham.ac.uk/parsepdb/
  • http://www.perl.com/pub/2001/11/16/perlbio2.html
  • https://www.biostars.org/p/89300/ (论坛帖子,而不是图书馆)

好吧..。所以perl不是我的第一语言,我也不知道你的数据是什么样子的。

编辑:--我的测试数据中有奇怪的行。这里有两套代码..。一个在空格上拆分,另一个使用预期/已知的列位置和长度来确定值。

代码语言:javascript
复制
#!/usr/bin/perl
use strict;
use warnings;

use DBI; 

my $db   = 'test'; 
my $host = 'localhost'; 
my $user = 'root';
my $pass = ''; 

my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass) 
            or die "Connection Error: $DBI::errstr\n";

my $localpath = 'C:\path\to\folder\with\datums';

my @filenames = ('glucagon');  # i am using this as my table name, too 
my $colnum    = 12;  # number of columns in data, I assumed this was fixed

my @placehoders; 
for (1..$colnum) { push @placehoders, '?'; } 
my $placeholders = join(',',@placehoders);   # builds a string like:  ?, ?, ?, ?, ...
                                             # for our query that uses binding 
foreach my $file (@filenames) {
   my $filename = "$localpath\\$file.txt"; 

   if (open(my $fh => $filename)) {  

      # the null at the start of the insert is because my first column is an 
      # auto_incrememnt primary key that will be generated on insert by the db 
      my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");

      while(my $line = <$fh>) {
         $line =~ s/\s+$//; # trim whitespace

         if ($line ne q{}) {  # if not totally blank
            my @row = split(/ +/,$line); # split on whitespace

            for my $index (1..$colnum) { 
               $stmt->bind_param($index, $row[$index]);
               $index++;
            }
            $stmt->execute(); 
         }
      }
      close $fh; 
   }
   else {  print "$file not opened\n";  }
}  
代码语言:javascript
复制
-- i didn't know appropriate names for any of it 
create table glucagon (
   row_id   int unsigned auto_increment primary key,
   name    varchar(10),
   seq     int,
   code1   varchar(5),
   code2   varchar(5),
   code3   varchar(5),
   code4   int,
   val1    decimal(10,2),
   val2    decimal(10,2),
   val3    decimal(10,2),
   val4    decimal(10,2),
   val5    decimal(10,2),
   code5   varchar(5)
)

C:\path\to\folder\with\datums\glucagon.txt中可以找到以下内容

代码语言:javascript
复制
ATOM   1058  N   ARG A 141      -6.466  12.036 -10.348  7.00 19.11           N
ATOM   1059  CA  ARG A 141      -7.922  12.248 -10.253  6.00 26.80           C
ATOM   1060  C   ARG A 141      -8.119  13.499  -9.393  6.00 28.93           C
ATOM   1061  O   ARG A 141      -7.112  13.967  -8.853  8.00 28.68           O
ATOM   1062  CB  ARG A 141      -8.639  11.005  -9.687  6.00 24.11           C
ATOM   1063  CG  ARG A 141      -8.153  10.551  -8.308  6.00 19.20           C
ATOM   1064  CD  ARG A 141      -8.914   9.319  -7.796  6.00 21.53           C
ATOM   1065  NE  ARG A 141      -8.517   9.076  -6.403  7.00 20.93           N
ATOM   1066  CZ  ARG A 141      -9.142   8.234  -5.593  6.00 23.56           C
ATOM   1067  NH1 ARG A 141     -10.150   7.487  -6.019  7.00 19.04           N
ATOM   1068  NH2 ARG A 141      -8.725   8.129  -4.343  7.00 25.11           N
ATOM   1069  OXT ARG A 141      -9.233  14.024  -9.296  8.00 40.35           O
TER    1070      ARG A 141
HETATM 1071 FE   HEM A   1       8.128   7.371 -15.022 24.00 16.74          FE
HETATM 1072  CHA HEM A   1       8.617   7.879 -18.361  6.00 17.74           C
HETATM 1073  CHB HEM A   1      10.356  10.005 -14.319  6.00 18.92           C
HETATM 1074  CHC HEM A   1       8.307   6.456 -11.669  6.00 11.00           C
HETATM 1075  CHD HEM A   1       6.928   4.145 -15.725  6.00 13.25           C

最终结果..。

代码语言:javascript
复制
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name   | seq  | code1 | code2 | code3 | code4 | val1   | val2  | val3   | val4  | val5  | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
|      1 | ATOM   | 1058 | N     | ARG   | A     |   141 |  -6.47 | 12.04 | -10.35 |  7.00 | 19.11 | N     |
|      2 | ATOM   | 1059 | CA    | ARG   | A     |   141 |  -7.92 | 12.25 | -10.25 |  6.00 | 26.80 | C     |
|      3 | ATOM   | 1060 | C     | ARG   | A     |   141 |  -8.12 | 13.50 |  -9.39 |  6.00 | 28.93 | C     |
|      4 | ATOM   | 1061 | O     | ARG   | A     |   141 |  -7.11 | 13.97 |  -8.85 |  8.00 | 28.68 | O     |
|      5 | ATOM   | 1062 | CB    | ARG   | A     |   141 |  -8.64 | 11.01 |  -9.69 |  6.00 | 24.11 | C     |
|      6 | ATOM   | 1063 | CG    | ARG   | A     |   141 |  -8.15 | 10.55 |  -8.31 |  6.00 | 19.20 | C     |
|      7 | ATOM   | 1064 | CD    | ARG   | A     |   141 |  -8.91 |  9.32 |  -7.80 |  6.00 | 21.53 | C     |
|      8 | ATOM   | 1065 | NE    | ARG   | A     |   141 |  -8.52 |  9.08 |  -6.40 |  7.00 | 20.93 | N     |
|      9 | ATOM   | 1066 | CZ    | ARG   | A     |   141 |  -9.14 |  8.23 |  -5.59 |  6.00 | 23.56 | C     |
|     10 | ATOM   | 1067 | NH1   | ARG   | A     |   141 | -10.15 |  7.49 |  -6.02 |  7.00 | 19.04 | N     |
|     11 | ATOM   | 1068 | NH2   | ARG   | A     |   141 |  -8.73 |  8.13 |  -4.34 |  7.00 | 25.11 | N     |
|     12 | ATOM   | 1069 | OXT   | ARG   | A     |   141 |  -9.23 | 14.02 |  -9.30 |  8.00 | 40.35 | O     |
|     13 | TER    | 1070 | ARG   | A     | 141   |  NULL |   NULL |  NULL |   NULL |  NULL |  NULL | NULL  |
|     14 | HETATM | 1071 | FE    | HEM   | A     |     1 |   8.13 |  7.37 | -15.02 | 24.00 | 16.74 | FE    |
|     15 | HETATM | 1072 | CHA   | HEM   | A     |     1 |   8.62 |  7.88 | -18.36 |  6.00 | 17.74 | C     |
|     16 | HETATM | 1073 | CHB   | HEM   | A     |     1 |  10.36 | 10.01 | -14.32 |  6.00 | 18.92 | C     |
|     17 | HETATM | 1074 | CHC   | HEM   | A     |     1 |   8.31 |  6.46 | -11.67 |  6.00 | 11.00 | C     |
|     18 | HETATM | 1075 | CHD   | HEM   | A     |     1 |   6.93 |  4.15 | -15.73 |  6.00 | 13.25 | C     |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
18 rows in set (0.00 sec)

噢..。听着..。这一排使它肮脏..。TER 1070 ARG A 141。如果您使用我的路线,我可以很容易地修复这个问题,但是如果您使用了另一个答案/方法,我就不会费心去更新它了。

好吧..。为了愚蠢的争吵。我检查并统计了测试数据集中每个值的起始位置和长度。我不知道当你加载不同的文件或者什么的时候你的信息是否有变化.因此,我以一种可以为您使用的每个文件设置它的方式。

代码语言:javascript
复制
#!/usr/bin/perl
use strict;
use warnings;

use DBI;  

my $db   = 'test'; 
my $host = 'localhost'; 
my $user = 'root';
my $pass = ''; 

my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass) 
            or die "Connection Error: $DBI::errstr\n";

my $localpath = 'C:\path\to\datums'; 

                                # first num is starting pos, second is length
my $fileinfo = { 'glucagon' => [[0,6],   #  'name'
                                [7,4],   #  'seq'
                                [12,4],  #  'code1'
                                [17,3],  #  'code2'
                                [21,1],  #  'code3'
                                [23,3],  #  'code4'
                                [27,12], #  'val1'
                                [39,7],  #  'val2'
                                [47,7],  #  'val3'
                                [55,5],  #  'val4'
                                [61,5],  #  'val5'
                                [69,10]  #  'code5'
                                ] 
               #  'second_file' => [ [0,5], # col1 
               #                     [6,5], # col2                     
               #                   ]  
               };  # i am using this as my table name, too  

foreach my $file (keys %$fileinfo) {
   my $filename = "$localpath\\$file.txt";  

   if (open(my $fh => $filename)) {   
      my $colnum = scalar @{$fileinfo->{$file}}; 

      my @placehoders; 
      for (1..$colnum) { push @placehoders, '?'; } 
      my $placeholders = join(',',@placehoders);   # builds a string like:  ?, ?, ?, ?, ...
                                                   # for our query that uses binding 

      # the null at the start of the insert is because my first column is an 
      # auto_incrememnt primary key that will be generated on insert by the db 
      my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");

      while(my $line = <$fh>) {
         $line =~ s/\s+$//;   # trim trailing whitespace 
         if ($line ne q{}) {  # if not totally blank  
            my @row;
            my $index = 1; 
            # foreach value column position & length 
            foreach my $col (@{$fileinfo->{$file}}) {   
               my $value;
               if ($col->[0] <= length($line)) {
                  $value = substr($line,$col->[0],$col->[1]); 
                  $value =~ s/^\s+|\s+$//g;  # trim trailing & leading whitespace
                  if ($value eq q{}) { undef $value; } # i like null values vs blank
               }
               $row[$index] = $value;
               $index++; 
            } 

            for my $index (1..$colnum) { 
               $stmt->bind_param($index, $row[$index]); 
            }
            $stmt->execute(); 
         }
      }
      close $fh; 
   }
   else {  print "$file not opened\n";  }
}  

新数据:

代码语言:javascript
复制
mysql> select * from  glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name   | seq  | code1 | code2 | code3 | code4 | val1   | val2  | val3   | val4  | val5  | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
|      1 | ATOM   | 1058 | N     | ARG   | A     |   141 |  -6.47 | 12.04 | -10.35 |  7.00 | 19.11 | N     |
|      2 | ATOM   | 1059 | CA    | ARG   | A     |   141 |  -7.92 | 12.25 | -10.25 |  6.00 | 26.80 | C     |
|      3 | ATOM   | 1060 | C     | ARG   | A     |   141 |  -8.12 | 13.50 |  -9.39 |  6.00 | 28.93 | C     |
|      4 | ATOM   | 1061 | O     | ARG   | A     |   141 |  -7.11 | 13.97 |  -8.85 |  8.00 | 28.68 | O     |
|      5 | ATOM   | 1062 | CB    | ARG   | A     |   141 |  -8.64 | 11.01 |  -9.69 |  6.00 | 24.11 | C     |
|      6 | ATOM   | 1063 | CG    | ARG   | A     |   141 |  -8.15 | 10.55 |  -8.31 |  6.00 | 19.20 | C     |
|      7 | ATOM   | 1064 | CD    | ARG   | A     |   141 |  -8.91 |  9.32 |  -7.80 |  6.00 | 21.53 | C     |
|      8 | ATOM   | 1065 | NE    | ARG   | A     |   141 |  -8.52 |  9.08 |  -6.40 |  7.00 | 20.93 | N     |
|      9 | ATOM   | 1066 | CZ    | ARG   | A     |   141 |  -9.14 |  8.23 |  -5.59 |  6.00 | 23.56 | C     |
|     10 | ATOM   | 1067 | NH1   | ARG   | A     |   141 | -10.15 |  7.49 |  -6.02 |  7.00 | 19.04 | N     |
|     11 | ATOM   | 1068 | NH2   | ARG   | A     |   141 |  -8.73 |  8.13 |  -4.34 |  7.00 | 25.11 | N     |
|     12 | ATOM   | 1069 | OXT   | ARG   | A     |   141 |  -9.23 | 14.02 |  -9.30 |  8.00 | 40.35 | O     |
|     13 | TER    | 1070 | NULL  | ARG   | A     |   141 |   NULL |  NULL |   NULL |  NULL |  NULL | NULL  |
|     14 | HETATM | 1071 | FE    | HEM   | A     |     1 |   8.13 |  7.37 | -15.02 | 24.00 | 16.74 | FE    |
|     15 | HETATM | 1072 | CHA   | HEM   | A     |     1 |   8.62 |  7.88 | -18.36 |  6.00 | 17.74 | C     |
|     16 | HETATM | 1073 | CHB   | HEM   | A     |     1 |  10.36 | 10.01 | -14.32 |  6.00 | 18.92 | C     |
|     17 | HETATM | 1074 | CHC   | HEM   | A     |     1 |   8.31 |  6.46 | -11.67 |  6.00 | 11.00 | C     |
|     18 | HETATM | 1075 | CHD   | HEM   | A     |     1 |   6.93 |  4.15 | -15.73 |  6.00 | 13.25 | C     |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
票数 4
EN

Stack Overflow用户

发布于 2014-10-10 17:44:21

您正在尝试将6600万个结果存储在一个数组中,正如您已经注意到的那样,这个数组速度慢,内存密集。Perl数组对于这样的大规模计算不是很好,但是PDL是。

问题的核心是计算多个三维坐标之间的距离。让我们首先对一个简化的数据集这样做,以证明我们可以这样做:

代码语言:javascript
复制
  Start         End
---------    ---------
(0, 0, 0)    (1, 2, 3)
(1, 1, 1)    (1, 1, 1)
(4, 5, 6)    (7, 8, 9)

我们可以像这样在PDL中表示这个数据集:

代码语言:javascript
复制
use PDL;
                  # x        # y        # z    
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end   = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];

我们现在有两组3D坐标。为了计算距离,首先从结束坐标中减去起始坐标:

代码语言:javascript
复制
my $diff = $end - $start;
print $diff;

这输出

代码语言:javascript
复制
[
 [1 0 3]
 [2 0 3]
 [3 0 3]
]

X坐标的差异在第一行,y坐标的差异在第二行,z坐标的差异在第三行。

接下来,我们必须消除分歧:

代码语言:javascript
复制
my $squared = $diff**2;
print $squared;

这给了我们

代码语言:javascript
复制
[
 [1 0 9]
 [4 0 9]
 [9 0 9]
]

最后,我们需要对每一对点求和差异的平方,并取平方根:

代码语言:javascript
复制
foreach my $i (0 .. $squared->dim(0) - 1) {
    say sqrt sum $squared($i,:);
}

(也许有更好的方法来做到这一点,但我并没有经常使用PDL。)

这个打印出来

代码语言:javascript
复制
3.74165738677394
0
5.19615242270663

这是我们的距离。

把这一切结合在一起:

代码语言:javascript
复制
use strict;
use warnings;
use 5.010;

use PDL;
use PDL::NiceSlice;

my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end   = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];

my $diff = $end - $start;
my $squared = $diff**2;

foreach my $i (0 .. $squared->dim(0) - 1) {
    say sqrt sum $squared($i,:);
}

在我的桌面上计算100万对坐标之间的距离并将结果写入文件需要35秒。当我尝试一千万对的时候,我的内存就用完了,所以你可能不得不把你的数据分成几个部分。

从文件中读取数据

下面是一个示例,它使用前面问题中包含的示例输入,从两个文件读取数据:

代码语言:javascript
复制
use strict;
use warnings;
use 5.010;

use PDL;
use PDL::IO::Misc;
use PDL::NiceSlice;

my $start_file = 'start.txt';
my $end_file   = 'end.txt';

my $start = rcols $start_file, [ 5..7 ];
my $end   = rcols $end_file, [ 5..7 ];

my $diff = $end - $start;
my $squared = $diff**2;
   
foreach my $i (0 .. $squared->dim(0) - 1) {
    say sqrt sum $squared($i,:);
}

start.txt

代码语言:javascript
复制
ATOM      1  N   GLU    1     -19.992  -2.816  36.359  0.00  0.00      PROT
ATOM      2  HT1 GLU    1     -19.781  -1.880  35.958  0.00  0.00      PROT
ATOM      3  HT2 GLU    1     -19.713  -2.740  37.358  0.00  0.00      PROT
ATOM      4  HT3 GLU    1     -21.027  -2.910  36.393  0.00  0.00      PROT
ATOM      5  CA  GLU    1     -19.344  -3.944  35.652  0.00  0.00      PROT
ATOM      6  HA  GLU    1     -19.817  -4.852  35.998  0.00  0.00      PROT
ATOM      7  CB  GLU    1     -19.501  -3.795  34.119  0.00  0.00      PROT

end.txt

代码语言:javascript
复制
ATOM   2084  N   POPC   1     -44.763  27.962  20.983  0.00  0.00      MEM1  
ATOM   2085  C12 POPC   1     -46.144  27.379  20.551  0.00  0.00      MEM1  
ATOM   2086  C13 POPC   1     -44.923  28.611  22.367  0.00  0.00      MEM1  
ATOM   2087  C14 POPC   1     -43.713  26.889  21.099  0.00  0.00      MEM1  
ATOM   2088  C15 POPC   1     -44.302  29.004  20.059  0.00  0.00      MEM1  
ATOM   2089 H12A POPC   1     -46.939  28.110  20.555  0.00  0.00      MEM1  
ATOM   2090 H12B POPC   1     -46.486  26.769  21.374  0.00  0.00      MEM1  

输出

代码语言:javascript
复制
42.3946824613654
42.2903357636233
42.9320321205507
40.4541893133455
44.1770768272415
45.3936402704167
42.7174829080553

rcols函数来自PDL::IO::Misc,可用于将文件中的特定列读入PDL对象(在本例中,列5至7,零索引)。

票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/26303965

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档