thanks for your reply!
i start with 100 points each with x,y,z coordinates.
`
4.941 32.586 -1.772
15.354 22.823 10.556
-0.495 12.345 98.234
...
$block_size = 1000;
$counter = 0;
%small_hash = ();
foreach @triplet of the 100 points{
$counter++;
my @matrix = ortho(@triplet);
foreach $point in the 100 points{
my $key = binning(@matrix * $point);
push @{$small_hash{$key}},@matrix;
}
#binning causes many points to fall in the same grid,
#thus the array of values (matrices) for each key
if($counter % $block_size == 0){
write out %small_hash;
%small_hash = (); #frees the memory
}
}
# now merging small hash files
foreach $key (sort keys %master_hash){#keys for master_hash stored
loop thru all small files{
while(<FILE>){
if($_ =~ /^$key/){
write values to the master output file;
last;
}#if
}#while
}#go thru files
}
`
so that's roughly the structure (actual code too long to post here). the first half is really fast to compute, but once i start merging files it takes forever. thanks a lot! | [reply] [d/l] |

All I learned from that about the big picture is that you are dealing with coordinates and you have some (undisclosed) way of making a key from the coordinates for each "something" you want to store. I still have no idea about what you actually want to store in a hashy thing, nor what you intend to do with it once stored.
Without know what you want to eventually do with this stuff, I still think using a database (DBI and DBD::SQLite) is likely to be a good solution to the problem. But, again, if you can mock up something close to your real problem in 30 or so lines (fudge the calculations etc.) then we can probably give you a good starting point for the data management code.
True laziness is hard work
| [reply] |

sorry i have not made it clearer. my actual hash looks like this:
`3_2_-1 => -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_0.041_0.20
+5_-0.538_-0.876_0.100 -56.368_2.583_-28.745_0.883_-0.012_0.005_-
+0.382_0.041_0.205_-0.538_-0.876_0.100 ...
`
each element in the array of values is the entries of a 3*4 matrix, and each key can point to a few hundred of such 'matrices'.
how i construct a key-value pair is as such: given a defined 3*4 matrix, i apply this transformation matrix (translation+rotation) to the coordinates of each of the 100 points, and obtain the new coor of that point, say (3.01,1.98,-0.87), which is discretized to be (3,2,-1). then (3,2,-1) is used as a key that points to such a transformation matrix.
here's an example script with simplified calculations.
`my $input = $ARGV[0];
open(INFILE,"$input") or die "cannot open file $input!\n";
my $output = $ARGV[1];
my %total_hash_keys=();
my %tri_hash = ();
#set bin width
my $grid = 2;
my $step = 0;
my $block_size = 10000;
my $block_no = 0;
my @points;
while(<INFILE>){my @array = split(/\t/,$_); push @points, [@array];}
close(INFILE);
#construct hash table
for(my $i=0;$i<@points;$i++){
for(my $j=$i+1;$j<@points;$j++){
for(my $k=$j+1;$k<@points;$k++){
$step++;
my @pt1 = (${$points[$i]}[0],${$points[$i]}[1],${$points[$i]}[2]
+);
my @pt2 = (${$points[$j]}[0],${$points[$j]}[1],${$points[$j]}[2]
+);
my @pt3 = (${$points[$k]}[0],${$points[$k]}[1],${$points[$k]}[2]
+;
#simplified calculation for the value of the hash;
my @matrix = (@pt1,@pt2,@pt3);
for(my $res=0;$res<@points;$res++){
#transform coor, and bin the new coor as a generated key
my @old_xyz = @{$points[$res]};
my @new_xyz = transform(@old_xyz,@matrix);
foreach(@new_xyz){$_ = int($_/$grid); }
my $key = $new_xyz[0]."_".$new_xyz[1]."_".$new_xyz[2];
foreach(@matrix){$_ = sprintf "%.3f",$_;}
my $value = "";
for(my $temp=0;$temp<@matrix;$temp++){$value .= $matrix[$temp]
+."_"; }
$total_hash_keys{$key}=0;
push @{$tri_hash{$key}},$value;
}
if(($step % $block_size) == 0){#write to disk file
$block_no = int($step/$block_size);
my $tmp_hash_file = "tmp_hash".$block_no;
open(OUTFILE,">$tmp_hash_file") or die "cannot write to file $
+tmp_hash_file!\n";
foreach(keys %tri_hash){
print OUTFILE "$_\t";
print OUTFILE "@{$tri_hash{$_}}\n";
}
%tri_hash = ();#free memory
}
}#for k
}#for j
}#for i
my $total_file_no = int($step/$block_size);
open(OUTFILE,">$output") or die "cannot write to file $output!\n";
while(($my_key,$my_value)=each %total_hash_keys){
print OUTFILE $my_key."=>";
for(my $i=1;$i<$total_file_no + 1;$i++){
my $hash_file = "tmp_hash".$i; open(INFILE,"$hash_file") or die;
while(<INFILE>){
my @array = split(/\t/,$_);
if($array[0] eq $my_key){
chomp ($array[1]);
print OUTFILE $array[1];
last;
}
}
close(INFILE);
}
print OUTFILE "\n";
}
sub transform{
my ($x,$y,$z,@t) = @_;
my $new_x=$x*$t[0]+$y*$t[3]+$z*$t[6];
my $new_y=$x*$t[1]+$y*$t[4]+$z*$t[7];
my $new_z=$x*$t[2]+$y*$t[5]+$z*$t[8];
return ($new_x,$new_y,$new_z);
}
`
| [reply] [d/l] [select] |

so let's see if I'm reading this right:
- you have 100 points
- for every possible triple (1,000,000 combinations) you're computing a matrix
- each matrix gets pushed 100 times (onto 100 different keys)
- you start a new file every 1000 matrices (so there'll be 1000 files when you're done, and each file will have 100,000 matrices.
but you don't say how many distinct keys there are (i.e., how finely `binning` is grouping things).
But it doesn't matter. For **every single key** you are doing a **linear search** through **every file**. So that's (number of keys) * 100,000,000 matrices to wade through; no wonder this is slow.
You need to be writing each `%small_hash` in sorted order. Sorting 1000 small_hashes will be faster than sorting the big hash
`(1000*(n/1000)*log(n/1000) = n log(n/1000) < n log(n)`
but the really important thing is this will enable the merge I was talking about before, which only reads each file **once** rather than once for every key value.
| [reply] [d/l] [select] |

Comment onRe: how to merge many files of sorted hashes?