我的当前脚本通过PDB中的数据运行,并将它们存储到数组中,然后将这些数组用于脚本的其余部分。这个脚本在一个小PDB文件上运行得很好,但是当我使用一个真正的PDB文件时,我只使用一个PDB文件上的所有计算机内存。我有2000份PDB文件,我需要计算。
这是我目前的完整剧本,有几个笔记。
完整脚本:
#!/usr/bin/perl
use warnings;
use strict;
#my $inputfile = $ARGV[0];
#my $inputfile = '8ns_emb_alt_test.pdb';
my $inputfile = '8ns_emb_alt_101.pdb';
open( INPUTFILE, "<", $inputfile ) or die $!;
my @array = <INPUTFILE>;
### Protein
my $protein = 'PROT';
my @protx;
my @proty;
my @protz;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( ( $array[$line] =~ m/\s+$protein\s+/ ) ) {
chomp $array[$line];
my @splitline = ( split /\s+/, $array[$line] );
push @protx, $splitline[5]; #This has 2083 x-coordinates
push @proty, $splitline[6]; #this has 2083 y-coordinates
push @protz, $splitline[7]; #this has 2083 z-coordinates
}
}
### Lipid
my $lipid1 = 'POPS';
my $lipid2 = 'POPC';
my @lipidx;
my @lipidy;
my @lipidz;
for ( my $line = 0; $line <= $#array; ++$line ) {
if ( ( $array[$line] =~ m/\s+$lipid1\s+/ ) || ( $array[$line] =~ m/\s+$lipid2\s+/ ) ) {
chomp $array[$line];
my @splitline = ( split /\s+/, $array[$line] );
push @lipidx, $splitline[5]; #this has approximately 35,000 x coordinates
push @lipidy, $splitline[6]; #same as above for y
push @lipidz, $splitline[7]; #same as above for z
}
}
### Calculation
my @deltaX = map {
my $diff = $_;
map { $diff - $_ } @lipidx
} @protx; #so this has 2083*35000 data x-coordinates
my @squared_deltaX = map { $_ * $_ } @deltaX; #this has all the x-coordinates squared from @deltaX
my @deltaY = map {
my $diff = $_;
map { $diff - $_ } @lipidy
} @proty;
my @squared_deltaY = map { $_ * $_ } @deltaY;
my @deltaZ = map {
my $diff = $_;
map { $diff - $_ } @lipidz
} @protz;
my @squared_deltaZ = map { $_ * $_ } @deltaZ;
my @distance;
for ( my $ticker = 0; $ticker <= $#array; ++$ticker ) {
my $distance_calc = sqrt( ( $squared_deltaX[$ticker] + $squared_deltaY[$ticker] + $squared_deltaZ[$ticker] ) );
push @distance, $distance_calc;
} #this runs the final calculation and computes all the distances between the atoms
### The Hunt
my $limit = 5;
my @DistU50;
my @resid_tagger;
for ( my $tracker = 0; $tracker <= $#array; ++$tracker ) {
my $dist = $distance[$tracker];
if ( ( $dist < $limit ) && ( $array[$tracker] =~ m/\s+$protein\s+/ ) ) {
my @splitline = ( split /\s+/, $array[$tracker] );
my $LT50 = $dist;
push @resid_tagger, $splitline[4]; #selects stores a selected index number
push @DistU50, $LT50; #stores the values within the $limit
}
} #this 'for' loop search for all the elements in the '@distance' and pushes them to the final arrays and also makes an array of certain index numbers in to another array.
### Le'Finali
print "@resid_tagger = resid \n";
print "5 > @DistU50 \n";
close INPUTFILE;我的一位实验室朋友说,我可以将一些数据存储在文件中,这样就可以减少内存。我认为这是个好主意,但我不知道哪里是最有效率的地方,我要做多少次。我这样做是因为这是我知道如何做到这一点的最好方法。
如果有人能给我举一个例子,说明我可以把数组转换成文件,然后再次使用该文件中的数据,那将是非常有帮助的。否则,如果有人有一个想法,我可以查,事情可以尝试或只是建议,至少会让我开始在某个地方。
发布于 2014-10-10 17:48:05
编辑.
你们两个..。我们应该先检查一下。您可能需要查看已经存在的用于处理和操作PDB数据的perl模块。
好吧..。所以perl不是我的第一语言,我也不知道你的数据是什么样子的。
编辑:--我的测试数据中有奇怪的行。这里有两套代码..。一个在空格上拆分,另一个使用预期/已知的列位置和长度来确定值。
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\folder\with\datums';
my @filenames = ('glucagon'); # i am using this as my table name, too
my $colnum = 12; # number of columns in data, I assumed this was fixed
my @placehoders;
for (1..$colnum) { push @placehoders, '?'; }
my $placeholders = join(',',@placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
foreach my $file (@filenames) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim whitespace
if ($line ne q{}) { # if not totally blank
my @row = split(/ +/,$line); # split on whitespace
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
$index++;
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
} -- i didn't know appropriate names for any of it
create table glucagon (
row_id int unsigned auto_increment primary key,
name varchar(10),
seq int,
code1 varchar(5),
code2 varchar(5),
code3 varchar(5),
code4 int,
val1 decimal(10,2),
val2 decimal(10,2),
val3 decimal(10,2),
val4 decimal(10,2),
val5 decimal(10,2),
code5 varchar(5)
)在C:\path\to\folder\with\datums\glucagon.txt中可以找到以下内容
ATOM 1058 N ARG A 141 -6.466 12.036 -10.348 7.00 19.11 N
ATOM 1059 CA ARG A 141 -7.922 12.248 -10.253 6.00 26.80 C
ATOM 1060 C ARG A 141 -8.119 13.499 -9.393 6.00 28.93 C
ATOM 1061 O ARG A 141 -7.112 13.967 -8.853 8.00 28.68 O
ATOM 1062 CB ARG A 141 -8.639 11.005 -9.687 6.00 24.11 C
ATOM 1063 CG ARG A 141 -8.153 10.551 -8.308 6.00 19.20 C
ATOM 1064 CD ARG A 141 -8.914 9.319 -7.796 6.00 21.53 C
ATOM 1065 NE ARG A 141 -8.517 9.076 -6.403 7.00 20.93 N
ATOM 1066 CZ ARG A 141 -9.142 8.234 -5.593 6.00 23.56 C
ATOM 1067 NH1 ARG A 141 -10.150 7.487 -6.019 7.00 19.04 N
ATOM 1068 NH2 ARG A 141 -8.725 8.129 -4.343 7.00 25.11 N
ATOM 1069 OXT ARG A 141 -9.233 14.024 -9.296 8.00 40.35 O
TER 1070 ARG A 141
HETATM 1071 FE HEM A 1 8.128 7.371 -15.022 24.00 16.74 FE
HETATM 1072 CHA HEM A 1 8.617 7.879 -18.361 6.00 17.74 C
HETATM 1073 CHB HEM A 1 10.356 10.005 -14.319 6.00 18.92 C
HETATM 1074 CHC HEM A 1 8.307 6.456 -11.669 6.00 11.00 C
HETATM 1075 CHD HEM A 1 6.928 4.145 -15.725 6.00 13.25 C最终结果..。
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
18 rows in set (0.00 sec)噢..。听着..。这一排使它肮脏..。TER 1070 ARG A 141。如果您使用我的路线,我可以很容易地修复这个问题,但是如果您使用了另一个答案/方法,我就不会费心去更新它了。
好吧..。为了愚蠢的争吵。我检查并统计了测试数据集中每个值的起始位置和长度。我不知道当你加载不同的文件或者什么的时候你的信息是否有变化.因此,我以一种可以为您使用的每个文件设置它的方式。
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $db = 'test';
my $host = 'localhost';
my $user = 'root';
my $pass = '';
my $dbh = DBI->connect("dbi:mysql:$db:$host",$user,$pass)
or die "Connection Error: $DBI::errstr\n";
my $localpath = 'C:\path\to\datums';
# first num is starting pos, second is length
my $fileinfo = { 'glucagon' => [[0,6], # 'name'
[7,4], # 'seq'
[12,4], # 'code1'
[17,3], # 'code2'
[21,1], # 'code3'
[23,3], # 'code4'
[27,12], # 'val1'
[39,7], # 'val2'
[47,7], # 'val3'
[55,5], # 'val4'
[61,5], # 'val5'
[69,10] # 'code5'
]
# 'second_file' => [ [0,5], # col1
# [6,5], # col2
# ]
}; # i am using this as my table name, too
foreach my $file (keys %$fileinfo) {
my $filename = "$localpath\\$file.txt";
if (open(my $fh => $filename)) {
my $colnum = scalar @{$fileinfo->{$file}};
my @placehoders;
for (1..$colnum) { push @placehoders, '?'; }
my $placeholders = join(',',@placehoders); # builds a string like: ?, ?, ?, ?, ...
# for our query that uses binding
# the null at the start of the insert is because my first column is an
# auto_incrememnt primary key that will be generated on insert by the db
my $stmt = $dbh->prepare("insert into $file values (null, $placeholders)");
while(my $line = <$fh>) {
$line =~ s/\s+$//; # trim trailing whitespace
if ($line ne q{}) { # if not totally blank
my @row;
my $index = 1;
# foreach value column position & length
foreach my $col (@{$fileinfo->{$file}}) {
my $value;
if ($col->[0] <= length($line)) {
$value = substr($line,$col->[0],$col->[1]);
$value =~ s/^\s+|\s+$//g; # trim trailing & leading whitespace
if ($value eq q{}) { undef $value; } # i like null values vs blank
}
$row[$index] = $value;
$index++;
}
for my $index (1..$colnum) {
$stmt->bind_param($index, $row[$index]);
}
$stmt->execute();
}
}
close $fh;
}
else { print "$file not opened\n"; }
} 新数据:
mysql> select * from glucagon;
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| row_id | name | seq | code1 | code2 | code3 | code4 | val1 | val2 | val3 | val4 | val5 | code5 |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+
| 1 | ATOM | 1058 | N | ARG | A | 141 | -6.47 | 12.04 | -10.35 | 7.00 | 19.11 | N |
| 2 | ATOM | 1059 | CA | ARG | A | 141 | -7.92 | 12.25 | -10.25 | 6.00 | 26.80 | C |
| 3 | ATOM | 1060 | C | ARG | A | 141 | -8.12 | 13.50 | -9.39 | 6.00 | 28.93 | C |
| 4 | ATOM | 1061 | O | ARG | A | 141 | -7.11 | 13.97 | -8.85 | 8.00 | 28.68 | O |
| 5 | ATOM | 1062 | CB | ARG | A | 141 | -8.64 | 11.01 | -9.69 | 6.00 | 24.11 | C |
| 6 | ATOM | 1063 | CG | ARG | A | 141 | -8.15 | 10.55 | -8.31 | 6.00 | 19.20 | C |
| 7 | ATOM | 1064 | CD | ARG | A | 141 | -8.91 | 9.32 | -7.80 | 6.00 | 21.53 | C |
| 8 | ATOM | 1065 | NE | ARG | A | 141 | -8.52 | 9.08 | -6.40 | 7.00 | 20.93 | N |
| 9 | ATOM | 1066 | CZ | ARG | A | 141 | -9.14 | 8.23 | -5.59 | 6.00 | 23.56 | C |
| 10 | ATOM | 1067 | NH1 | ARG | A | 141 | -10.15 | 7.49 | -6.02 | 7.00 | 19.04 | N |
| 11 | ATOM | 1068 | NH2 | ARG | A | 141 | -8.73 | 8.13 | -4.34 | 7.00 | 25.11 | N |
| 12 | ATOM | 1069 | OXT | ARG | A | 141 | -9.23 | 14.02 | -9.30 | 8.00 | 40.35 | O |
| 13 | TER | 1070 | NULL | ARG | A | 141 | NULL | NULL | NULL | NULL | NULL | NULL |
| 14 | HETATM | 1071 | FE | HEM | A | 1 | 8.13 | 7.37 | -15.02 | 24.00 | 16.74 | FE |
| 15 | HETATM | 1072 | CHA | HEM | A | 1 | 8.62 | 7.88 | -18.36 | 6.00 | 17.74 | C |
| 16 | HETATM | 1073 | CHB | HEM | A | 1 | 10.36 | 10.01 | -14.32 | 6.00 | 18.92 | C |
| 17 | HETATM | 1074 | CHC | HEM | A | 1 | 8.31 | 6.46 | -11.67 | 6.00 | 11.00 | C |
| 18 | HETATM | 1075 | CHD | HEM | A | 1 | 6.93 | 4.15 | -15.73 | 6.00 | 13.25 | C |
+--------+--------+------+-------+-------+-------+-------+--------+-------+--------+-------+-------+-------+发布于 2014-10-10 17:44:21
您正在尝试将6600万个结果存储在一个数组中,正如您已经注意到的那样,这个数组速度慢,内存密集。Perl数组对于这样的大规模计算不是很好,但是PDL是。
问题的核心是计算多个三维坐标之间的距离。让我们首先对一个简化的数据集这样做,以证明我们可以这样做:
Start End
--------- ---------
(0, 0, 0) (1, 2, 3)
(1, 1, 1) (1, 1, 1)
(4, 5, 6) (7, 8, 9)我们可以像这样在PDL中表示这个数据集:
use PDL;
# x # y # z
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];我们现在有两组3D坐标。为了计算距离,首先从结束坐标中减去起始坐标:
my $diff = $end - $start;
print $diff;这输出
[
[1 0 3]
[2 0 3]
[3 0 3]
]X坐标的差异在第一行,y坐标的差异在第二行,z坐标的差异在第三行。
接下来,我们必须消除分歧:
my $squared = $diff**2;
print $squared;这给了我们
[
[1 0 9]
[4 0 9]
[9 0 9]
]最后,我们需要对每一对点求和差异的平方,并取平方根:
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}(也许有更好的方法来做到这一点,但我并没有经常使用PDL。)
这个打印出来
3.74165738677394
0
5.19615242270663这是我们的距离。
把这一切结合在一起:
use strict;
use warnings;
use 5.010;
use PDL;
use PDL::NiceSlice;
my $start = pdl [ [0, 1, 4], [0, 1, 5], [0, 1, 6] ];
my $end = pdl [ [1, 1, 7], [2, 1, 8], [3, 1, 9] ];
my $diff = $end - $start;
my $squared = $diff**2;
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}在我的桌面上计算100万对坐标之间的距离并将结果写入文件需要35秒。当我尝试一千万对的时候,我的内存就用完了,所以你可能不得不把你的数据分成几个部分。
从文件中读取数据
下面是一个示例,它使用前面问题中包含的示例输入,从两个文件读取数据:
use strict;
use warnings;
use 5.010;
use PDL;
use PDL::IO::Misc;
use PDL::NiceSlice;
my $start_file = 'start.txt';
my $end_file = 'end.txt';
my $start = rcols $start_file, [ 5..7 ];
my $end = rcols $end_file, [ 5..7 ];
my $diff = $end - $start;
my $squared = $diff**2;
foreach my $i (0 .. $squared->dim(0) - 1) {
say sqrt sum $squared($i,:);
}start.txt
ATOM 1 N GLU 1 -19.992 -2.816 36.359 0.00 0.00 PROT
ATOM 2 HT1 GLU 1 -19.781 -1.880 35.958 0.00 0.00 PROT
ATOM 3 HT2 GLU 1 -19.713 -2.740 37.358 0.00 0.00 PROT
ATOM 4 HT3 GLU 1 -21.027 -2.910 36.393 0.00 0.00 PROT
ATOM 5 CA GLU 1 -19.344 -3.944 35.652 0.00 0.00 PROT
ATOM 6 HA GLU 1 -19.817 -4.852 35.998 0.00 0.00 PROT
ATOM 7 CB GLU 1 -19.501 -3.795 34.119 0.00 0.00 PROTend.txt
ATOM 2084 N POPC 1 -44.763 27.962 20.983 0.00 0.00 MEM1
ATOM 2085 C12 POPC 1 -46.144 27.379 20.551 0.00 0.00 MEM1
ATOM 2086 C13 POPC 1 -44.923 28.611 22.367 0.00 0.00 MEM1
ATOM 2087 C14 POPC 1 -43.713 26.889 21.099 0.00 0.00 MEM1
ATOM 2088 C15 POPC 1 -44.302 29.004 20.059 0.00 0.00 MEM1
ATOM 2089 H12A POPC 1 -46.939 28.110 20.555 0.00 0.00 MEM1
ATOM 2090 H12B POPC 1 -46.486 26.769 21.374 0.00 0.00 MEM1 输出
42.3946824613654
42.2903357636233
42.9320321205507
40.4541893133455
44.1770768272415
45.3936402704167
42.7174829080553rcols函数来自PDL::IO::Misc,可用于将文件中的特定列读入PDL对象(在本例中,列5至7,零索引)。
https://stackoverflow.com/questions/26303965
复制相似问题