in reply to
Re: how to merge many files of sorted hashes?
in thread how to merge many files of sorted hashes?
thanks for your reply!
i start with 100 points each with x,y,z coordinates.
4.941 32.586 1.772
15.354 22.823 10.556
0.495 12.345 98.234
...
$block_size = 1000;
$counter = 0;
%small_hash = ();
foreach @triplet of the 100 points{
$counter++;
my @matrix = ortho(@triplet);
foreach $point in the 100 points{
my $key = binning(@matrix * $point);
push @{$small_hash{$key}},@matrix;
}
#binning causes many points to fall in the same grid,
#thus the array of values (matrices) for each key
if($counter % $block_size == 0){
write out %small_hash;
%small_hash = (); #frees the memory
}
}
# now merging small hash files
foreach $key (sort keys %master_hash){#keys for master_hash stored
loop thru all small files{
while(<FILE>){
if($_ =~ /^$key/){
write values to the master output file;
last;
}#if
}#while
}#go thru files
}
so that's roughly the structure (actual code too long to post here). the first half is really fast to compute, but once i start merging files it takes forever. thanks a lot!
Re^3: how to merge many files of sorted hashes? by GrandFather (Sage) on Feb 02, 2012 at 21:08 UTC 
All I learned from that about the big picture is that you are dealing with coordinates and you have some (undisclosed) way of making a key from the coordinates for each "something" you want to store. I still have no idea about what you actually want to store in a hashy thing, nor what you intend to do with it once stored.
Without know what you want to eventually do with this stuff, I still think using a database (DBI and DBD::SQLite) is likely to be a good solution to the problem. But, again, if you can mock up something close to your real problem in 30 or so lines (fudge the calculations etc.) then we can probably give you a good starting point for the data management code.
True laziness is hard work
 [reply] 

sorry i have not made it clearer. my actual hash looks like this:
3_2_1 => 44.368_23.583_218.345_0.983_0.012_0.005_0.382_0.041_0.20
+5_0.538_0.876_0.100 56.368_2.583_28.745_0.883_0.012_0.005_
+0.382_0.041_0.205_0.538_0.876_0.100 ...
each element in the array of values is the entries of a 3*4 matrix, and each key can point to a few hundred of such 'matrices'.
how i construct a keyvalue pair is as such: given a defined 3*4 matrix, i apply this transformation matrix (translation+rotation) to the coordinates of each of the 100 points, and obtain the new coor of that point, say (3.01,1.98,0.87), which is discretized to be (3,2,1). then (3,2,1) is used as a key that points to such a transformation matrix.
here's an example script with simplified calculations.
my $input = $ARGV[0];
open(INFILE,"$input") or die "cannot open file $input!\n";
my $output = $ARGV[1];
my %total_hash_keys=();
my %tri_hash = ();
#set bin width
my $grid = 2;
my $step = 0;
my $block_size = 10000;
my $block_no = 0;
my @points;
while(<INFILE>){my @array = split(/\t/,$_); push @points, [@array];}
close(INFILE);
#construct hash table
for(my $i=0;$i<@points;$i++){
for(my $j=$i+1;$j<@points;$j++){
for(my $k=$j+1;$k<@points;$k++){
$step++;
my @pt1 = (${$points[$i]}[0],${$points[$i]}[1],${$points[$i]}[2]
+);
my @pt2 = (${$points[$j]}[0],${$points[$j]}[1],${$points[$j]}[2]
+);
my @pt3 = (${$points[$k]}[0],${$points[$k]}[1],${$points[$k]}[2]
+;
#simplified calculation for the value of the hash;
my @matrix = (@pt1,@pt2,@pt3);
for(my $res=0;$res<@points;$res++){
#transform coor, and bin the new coor as a generated key
my @old_xyz = @{$points[$res]};
my @new_xyz = transform(@old_xyz,@matrix);
foreach(@new_xyz){$_ = int($_/$grid); }
my $key = $new_xyz[0]."_".$new_xyz[1]."_".$new_xyz[2];
foreach(@matrix){$_ = sprintf "%.3f",$_;}
my $value = "";
for(my $temp=0;$temp<@matrix;$temp++){$value .= $matrix[$temp]
+."_"; }
$total_hash_keys{$key}=0;
push @{$tri_hash{$key}},$value;
}
if(($step % $block_size) == 0){#write to disk file
$block_no = int($step/$block_size);
my $tmp_hash_file = "tmp_hash".$block_no;
open(OUTFILE,">$tmp_hash_file") or die "cannot write to file $
+tmp_hash_file!\n";
foreach(keys %tri_hash){
print OUTFILE "$_\t";
print OUTFILE "@{$tri_hash{$_}}\n";
}
%tri_hash = ();#free memory
}
}#for k
}#for j
}#for i
my $total_file_no = int($step/$block_size);
open(OUTFILE,">$output") or die "cannot write to file $output!\n";
while(($my_key,$my_value)=each %total_hash_keys){
print OUTFILE $my_key."=>";
for(my $i=1;$i<$total_file_no + 1;$i++){
my $hash_file = "tmp_hash".$i; open(INFILE,"$hash_file") or die;
while(<INFILE>){
my @array = split(/\t/,$_);
if($array[0] eq $my_key){
chomp ($array[1]);
print OUTFILE $array[1];
last;
}
}
close(INFILE);
}
print OUTFILE "\n";
}
sub transform{
my ($x,$y,$z,@t) = @_;
my $new_x=$x*$t[0]+$y*$t[3]+$z*$t[6];
my $new_y=$x*$t[1]+$y*$t[4]+$z*$t[7];
my $new_z=$x*$t[2]+$y*$t[5]+$z*$t[8];
return ($new_x,$new_y,$new_z);
}
 [reply] [d/l] [select] 

#!/usr/bin/env perl
use strict;
use warnings;
use DBI;
my $dbh = DBI>connect('dbi:SQLite:dbname=delme.sqlite', '');
$dbh>do('CREATE TABLE Bins (Xk INTEGER, Yk INTEGER, Zk INTEGER, Data
+TEXT)');
my $sql = 'INSERT INTO Bins (Xk, Yk, Zk, Data) VALUES (?, ?, ?, ?)';
my $sth = $dbh>prepare ($sql);
while (defined (my $data = <DATA>)) {
my ($xKey, $yKey, $zKey) = split ' ', $data;
chomp $data;
$sth>execute((map {int} $xKey, $yKey, $zKey), $data);
}
$sql = 'SELECT * FROM Bins ORDER BY Xk, Yk, Zk';
$sth = $dbh>prepare($sql);
$sth>execute();
while (my $row = $sth>fetchrow_hashref()) {
print "$row>{Xk}, $row>{Yk}, $row>{Zk} => $row>{Data}\n";
}
__DATA__
4.941 32.586 1.772 44.368_23.583_218.345_0.983_0.012_0.005_0.382_
+0.041_0.205
15.354 22.823 10.556 56.368_2.583_28.745_0.883_0.012_0.005_0.382_0
+.041_0.205
0.495 12.345 98.234 0.382_0.041_0.205_28.745_0.883_0.012_0.005_0.
+382_0.041
Prints:
0, 12, 98 => 0.495 12.345 98.234 0.382_0.041_0.205_28.745_0.883_0.
+012_0.005_0.382_0.041
4, 32, 1 => 4.941 32.586 1.772 44.368_23.583_218.345_0.983_0.012_
+0.005_0.382_0.041_0.205
15, 22, 10 => 15.354 22.823 10.556 56.368_2.583_28.745_0.883_0.012_
+0.005_0.382_0.041_0.205
True laziness is hard work
 [reply] [d/l] [select] 

Re^3: how to merge many files of sorted hashes? by wrog (Pilgrim) on Feb 02, 2012 at 23:25 UTC 
so let's see if I'm reading this right:
 you have 100 points
 for every possible triple (1,000,000 combinations) you're computing a matrix
 each matrix gets pushed 100 times (onto 100 different keys)
 you start a new file every 1000 matrices (so there'll be 1000 files when you're done, and each file will have 100,000 matrices.
but you don't say how many distinct keys there are (i.e., how finely binning is grouping things).
But it doesn't matter. For every single key you are doing a linear search through every file. So that's (number of keys) * 100,000,000 matrices to wade through; no wonder this is slow.
You need to be writing each %small_hash in sorted order. Sorting 1000 small_hashes will be faster than sorting the big hash
(1000*(n/1000)*log(n/1000) = n log(n/1000) < n log(n)
but the really important thing is this will enable the merge I was talking about before, which only reads each file once rather than once for every key value.  [reply] [d/l] [select] 

