Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
"be consistent"
 
PerlMonks  

Re^4: how to merge many files of sorted hashes?

by andromedia33 (Novice)
on Feb 02, 2012 at 22:54 UTC ( #951549=note: print w/ replies, xml ) Need Help??


in reply to Re^3: how to merge many files of sorted hashes?
in thread how to merge many files of sorted hashes?

sorry i have not made it clearer. my actual hash looks like this:

3_2_-1 => -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_0.041_0.20 +5_-0.538_-0.876_0.100 -56.368_2.583_-28.745_0.883_-0.012_0.005_- +0.382_0.041_0.205_-0.538_-0.876_0.100 ...
each element in the array of values is the entries of a 3*4 matrix, and each key can point to a few hundred of such 'matrices'. how i construct a key-value pair is as such: given a defined 3*4 matrix, i apply this transformation matrix (translation+rotation) to the coordinates of each of the 100 points, and obtain the new coor of that point, say (3.01,1.98,-0.87), which is discretized to be (3,2,-1). then (3,2,-1) is used as a key that points to such a transformation matrix.
here's an example script with simplified calculations.
my $input = $ARGV[0]; open(INFILE,"$input") or die "cannot open file $input!\n"; my $output = $ARGV[1]; my %total_hash_keys=(); my %tri_hash = (); #set bin width my $grid = 2; my $step = 0; my $block_size = 10000; my $block_no = 0; my @points; while(<INFILE>){my @array = split(/\t/,$_); push @points, [@array];} close(INFILE); #construct hash table for(my $i=0;$i<@points;$i++){ for(my $j=$i+1;$j<@points;$j++){ for(my $k=$j+1;$k<@points;$k++){ $step++; my @pt1 = (${$points[$i]}[0],${$points[$i]}[1],${$points[$i]}[2] +); my @pt2 = (${$points[$j]}[0],${$points[$j]}[1],${$points[$j]}[2] +); my @pt3 = (${$points[$k]}[0],${$points[$k]}[1],${$points[$k]}[2] +; #simplified calculation for the value of the hash; my @matrix = (@pt1,@pt2,@pt3); for(my $res=0;$res<@points;$res++){ #transform coor, and bin the new coor as a generated key my @old_xyz = @{$points[$res]}; my @new_xyz = transform(@old_xyz,@matrix); foreach(@new_xyz){$_ = int($_/$grid); } my $key = $new_xyz[0]."_".$new_xyz[1]."_".$new_xyz[2]; foreach(@matrix){$_ = sprintf "%.3f",$_;} my $value = ""; for(my $temp=0;$temp<@matrix;$temp++){$value .= $matrix[$temp] +."_"; } $total_hash_keys{$key}=0; push @{$tri_hash{$key}},$value; } if(($step % $block_size) == 0){#write to disk file $block_no = int($step/$block_size); my $tmp_hash_file = "tmp_hash".$block_no; open(OUTFILE,">$tmp_hash_file") or die "cannot write to file $ +tmp_hash_file!\n"; foreach(keys %tri_hash){ print OUTFILE "$_\t"; print OUTFILE "@{$tri_hash{$_}}\n"; } %tri_hash = ();#free memory } }#for k }#for j }#for i my $total_file_no = int($step/$block_size); open(OUTFILE,">$output") or die "cannot write to file $output!\n"; while(($my_key,$my_value)=each %total_hash_keys){ print OUTFILE $my_key."=>"; for(my $i=1;$i<$total_file_no + 1;$i++){ my $hash_file = "tmp_hash".$i; open(INFILE,"$hash_file") or die; while(<INFILE>){ my @array = split(/\t/,$_); if($array[0] eq $my_key){ chomp ($array[1]); print OUTFILE $array[1]; last; } } close(INFILE); } print OUTFILE "\n"; } sub transform{ my ($x,$y,$z,@t) = @_; my $new_x=$x*$t[0]+$y*$t[3]+$z*$t[6]; my $new_y=$x*$t[1]+$y*$t[4]+$z*$t[7]; my $new_z=$x*$t[2]+$y*$t[5]+$z*$t[8]; return ($new_x,$new_y,$new_z); }


Comment on Re^4: how to merge many files of sorted hashes?
Select or Download Code
Re^5: how to merge many files of sorted hashes?
by GrandFather (Cardinal) on Feb 03, 2012 at 00:26 UTC

    We are about 5% closer to understanding the bigger picture so at this point I'll give up trying to figure out how best to help you and simply toss a little database code in your direction instead:

    #!/usr/bin/env perl use strict; use warnings; use DBI; my $dbh = DBI->connect('dbi:SQLite:dbname=delme.sqlite', ''); $dbh->do('CREATE TABLE Bins (Xk INTEGER, Yk INTEGER, Zk INTEGER, Data +TEXT)'); my $sql = 'INSERT INTO Bins (Xk, Yk, Zk, Data) VALUES (?, ?, ?, ?)'; my $sth = $dbh->prepare ($sql); while (defined (my $data = <DATA>)) { my ($xKey, $yKey, $zKey) = split ' ', $data; chomp $data; $sth->execute((map {int} $xKey, $yKey, $zKey), $data); } $sql = 'SELECT * FROM Bins ORDER BY Xk, Yk, Zk'; $sth = $dbh->prepare($sql); $sth->execute(); while (my $row = $sth->fetchrow_hashref()) { print "$row->{Xk}, $row->{Yk}, $row->{Zk} => $row->{Data}\n"; } __DATA__ 4.941 32.586 -1.772 -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_ +0.041_0.205 15.354 22.823 10.556 -56.368_2.583_-28.745_0.883_-0.012_0.005_-0.382_0 +.041_0.205 -0.495 12.345 98.234 -0.382_0.041_0.205_-28.745_0.883_-0.012_0.005_-0. +382_0.041

    Prints:

    0, 12, 98 => -0.495 12.345 98.234 -0.382_0.041_0.205_-28.745_0.883_-0. +012_0.005_-0.382_0.041 4, 32, -1 => 4.941 32.586 -1.772 -44.368_23.583_-218.345_0.983_-0.012_ +0.005_-0.382_0.041_0.205 15, 22, 10 => 15.354 22.823 10.556 -56.368_2.583_-28.745_0.883_-0.012_ +0.005_-0.382_0.041_0.205
    True laziness is hard work
      Thank you very much for your help, GrandFather. I apologize for missing the point of your question. indeed building a database seems a plausible thing to do given the large quantity of data i have.
      i have about 10,000 such input files, each consisting of a point cloud. i am constructing a hash table for each input file, so in the end i have about 10,000 hashes. (not all hash tables are huge, as most files only have about 20 points as opposed to the 100 points that cause the problem i mentioned here)
      eventually what i'd like to do with these hashes is that i will do pairwise comparison and look for common keys between each pair. that information will be used to compute a distance/dissimilarity measure between the pair of point clouds from which the pair of hash tables being compared come from. in the very end i hope to perform clustering on the 10,000 sets of point clouds.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://951549]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (14)
As of 2014-04-18 14:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (469 votes), past polls