in reply to Re^3: how to merge many files of sorted hashes? in thread how to merge many files of sorted hashes?
sorry i have not made it clearer. my actual hash looks like this:
3_2_1 => 44.368_23.583_218.345_0.983_0.012_0.005_0.382_0.041_0.20
+5_0.538_0.876_0.100 56.368_2.583_28.745_0.883_0.012_0.005_
+0.382_0.041_0.205_0.538_0.876_0.100 ...
each element in the array of values is the entries of a 3*4 matrix, and each key can point to a few hundred of such 'matrices'.
how i construct a keyvalue pair is as such: given a defined 3*4 matrix, i apply this transformation matrix (translation+rotation) to the coordinates of each of the 100 points, and obtain the new coor of that point, say (3.01,1.98,0.87), which is discretized to be (3,2,1). then (3,2,1) is used as a key that points to such a transformation matrix.
here's an example script with simplified calculations.
my $input = $ARGV[0];
open(INFILE,"$input") or die "cannot open file $input!\n";
my $output = $ARGV[1];
my %total_hash_keys=();
my %tri_hash = ();
#set bin width
my $grid = 2;
my $step = 0;
my $block_size = 10000;
my $block_no = 0;
my @points;
while(<INFILE>){my @array = split(/\t/,$_); push @points, [@array];}
close(INFILE);
#construct hash table
for(my $i=0;$i<@points;$i++){
for(my $j=$i+1;$j<@points;$j++){
for(my $k=$j+1;$k<@points;$k++){
$step++;
my @pt1 = (${$points[$i]}[0],${$points[$i]}[1],${$points[$i]}[2]
+);
my @pt2 = (${$points[$j]}[0],${$points[$j]}[1],${$points[$j]}[2]
+);
my @pt3 = (${$points[$k]}[0],${$points[$k]}[1],${$points[$k]}[2]
+;
#simplified calculation for the value of the hash;
my @matrix = (@pt1,@pt2,@pt3);
for(my $res=0;$res<@points;$res++){
#transform coor, and bin the new coor as a generated key
my @old_xyz = @{$points[$res]};
my @new_xyz = transform(@old_xyz,@matrix);
foreach(@new_xyz){$_ = int($_/$grid); }
my $key = $new_xyz[0]."_".$new_xyz[1]."_".$new_xyz[2];
foreach(@matrix){$_ = sprintf "%.3f",$_;}
my $value = "";
for(my $temp=0;$temp<@matrix;$temp++){$value .= $matrix[$temp]
+."_"; }
$total_hash_keys{$key}=0;
push @{$tri_hash{$key}},$value;
}
if(($step % $block_size) == 0){#write to disk file
$block_no = int($step/$block_size);
my $tmp_hash_file = "tmp_hash".$block_no;
open(OUTFILE,">$tmp_hash_file") or die "cannot write to file $
+tmp_hash_file!\n";
foreach(keys %tri_hash){
print OUTFILE "$_\t";
print OUTFILE "@{$tri_hash{$_}}\n";
}
%tri_hash = ();#free memory
}
}#for k
}#for j
}#for i
my $total_file_no = int($step/$block_size);
open(OUTFILE,">$output") or die "cannot write to file $output!\n";
while(($my_key,$my_value)=each %total_hash_keys){
print OUTFILE $my_key."=>";
for(my $i=1;$i<$total_file_no + 1;$i++){
my $hash_file = "tmp_hash".$i; open(INFILE,"$hash_file") or die;
while(<INFILE>){
my @array = split(/\t/,$_);
if($array[0] eq $my_key){
chomp ($array[1]);
print OUTFILE $array[1];
last;
}
}
close(INFILE);
}
print OUTFILE "\n";
}
sub transform{
my ($x,$y,$z,@t) = @_;
my $new_x=$x*$t[0]+$y*$t[3]+$z*$t[6];
my $new_y=$x*$t[1]+$y*$t[4]+$z*$t[7];
my $new_z=$x*$t[2]+$y*$t[5]+$z*$t[8];
return ($new_x,$new_y,$new_z);
}
Re^5: how to merge many files of sorted hashes? by GrandFather (Sage) on Feb 03, 2012 at 00:26 UTC 
We are about 5% closer to understanding the bigger picture so at this point I'll give up trying to figure out how best to help you and simply toss a little database code in your direction instead:
#!/usr/bin/env perl
use strict;
use warnings;
use DBI;
my $dbh = DBI>connect('dbi:SQLite:dbname=delme.sqlite', '');
$dbh>do('CREATE TABLE Bins (Xk INTEGER, Yk INTEGER, Zk INTEGER, Data
+TEXT)');
my $sql = 'INSERT INTO Bins (Xk, Yk, Zk, Data) VALUES (?, ?, ?, ?)';
my $sth = $dbh>prepare ($sql);
while (defined (my $data = <DATA>)) {
my ($xKey, $yKey, $zKey) = split ' ', $data;
chomp $data;
$sth>execute((map {int} $xKey, $yKey, $zKey), $data);
}
$sql = 'SELECT * FROM Bins ORDER BY Xk, Yk, Zk';
$sth = $dbh>prepare($sql);
$sth>execute();
while (my $row = $sth>fetchrow_hashref()) {
print "$row>{Xk}, $row>{Yk}, $row>{Zk} => $row>{Data}\n";
}
__DATA__
4.941 32.586 1.772 44.368_23.583_218.345_0.983_0.012_0.005_0.382_
+0.041_0.205
15.354 22.823 10.556 56.368_2.583_28.745_0.883_0.012_0.005_0.382_0
+.041_0.205
0.495 12.345 98.234 0.382_0.041_0.205_28.745_0.883_0.012_0.005_0.
+382_0.041
Prints:
0, 12, 98 => 0.495 12.345 98.234 0.382_0.041_0.205_28.745_0.883_0.
+012_0.005_0.382_0.041
4, 32, 1 => 4.941 32.586 1.772 44.368_23.583_218.345_0.983_0.012_
+0.005_0.382_0.041_0.205
15, 22, 10 => 15.354 22.823 10.556 56.368_2.583_28.745_0.883_0.012_
+0.005_0.382_0.041_0.205
True laziness is hard work
 [reply] [d/l] [select] 

Thank you very much for your help, GrandFather. I apologize for missing the point of your question. indeed building a database seems a plausible thing to do given the large quantity of data i have.
i have about 10,000 such input files, each consisting of a point cloud. i am constructing a hash table for each input file, so in the end i have about 10,000 hashes. (not all hash tables are huge, as most files only have about 20 points as opposed to the 100 points that cause the problem i mentioned here)
eventually what i'd like to do with these hashes is that i will do pairwise comparison and look for common keys between each pair. that information will be used to compute a distance/dissimilarity measure between the pair of point clouds from which the pair of hash tables being compared come from. in the very end i hope to perform clustering on the 10,000 sets of point clouds.
 [reply] 

