Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^3: how to merge many files of sorted hashes?

by GrandFather (Sage)
on Feb 02, 2012 at 21:08 UTC ( #951532=note: print w/replies, xml ) Need Help??

in reply to Re^2: how to merge many files of sorted hashes?
in thread how to merge many files of sorted hashes?

All I learned from that about the big picture is that you are dealing with coordinates and you have some (undisclosed) way of making a key from the coordinates for each "something" you want to store. I still have no idea about what you actually want to store in a hashy thing, nor what you intend to do with it once stored.

Without know what you want to eventually do with this stuff, I still think using a database (DBI and DBD::SQLite) is likely to be a good solution to the problem. But, again, if you can mock up something close to your real problem in 30 or so lines (fudge the calculations etc.) then we can probably give you a good starting point for the data management code.

True laziness is hard work
  • Comment on Re^3: how to merge many files of sorted hashes?

Replies are listed 'Best First'.
Re^4: how to merge many files of sorted hashes?
by andromedia33 (Novice) on Feb 02, 2012 at 22:54 UTC
    sorry i have not made it clearer. my actual hash looks like this:
    3_2_-1 => -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_0.041_0.20 +5_-0.538_-0.876_0.100 -56.368_2.583_-28.745_0.883_-0.012_0.005_- +0.382_0.041_0.205_-0.538_-0.876_0.100 ...
    each element in the array of values is the entries of a 3*4 matrix, and each key can point to a few hundred of such 'matrices'. how i construct a key-value pair is as such: given a defined 3*4 matrix, i apply this transformation matrix (translation+rotation) to the coordinates of each of the 100 points, and obtain the new coor of that point, say (3.01,1.98,-0.87), which is discretized to be (3,2,-1). then (3,2,-1) is used as a key that points to such a transformation matrix.
    here's an example script with simplified calculations.
    my $input = $ARGV[0]; open(INFILE,"$input") or die "cannot open file $input!\n"; my $output = $ARGV[1]; my %total_hash_keys=(); my %tri_hash = (); #set bin width my $grid = 2; my $step = 0; my $block_size = 10000; my $block_no = 0; my @points; while(<INFILE>){my @array = split(/\t/,$_); push @points, [@array];} close(INFILE); #construct hash table for(my $i=0;$i<@points;$i++){ for(my $j=$i+1;$j<@points;$j++){ for(my $k=$j+1;$k<@points;$k++){ $step++; my @pt1 = (${$points[$i]}[0],${$points[$i]}[1],${$points[$i]}[2] +); my @pt2 = (${$points[$j]}[0],${$points[$j]}[1],${$points[$j]}[2] +); my @pt3 = (${$points[$k]}[0],${$points[$k]}[1],${$points[$k]}[2] +; #simplified calculation for the value of the hash; my @matrix = (@pt1,@pt2,@pt3); for(my $res=0;$res<@points;$res++){ #transform coor, and bin the new coor as a generated key my @old_xyz = @{$points[$res]}; my @new_xyz = transform(@old_xyz,@matrix); foreach(@new_xyz){$_ = int($_/$grid); } my $key = $new_xyz[0]."_".$new_xyz[1]."_".$new_xyz[2]; foreach(@matrix){$_ = sprintf "%.3f",$_;} my $value = ""; for(my $temp=0;$temp<@matrix;$temp++){$value .= $matrix[$temp] +."_"; } $total_hash_keys{$key}=0; push @{$tri_hash{$key}},$value; } if(($step % $block_size) == 0){#write to disk file $block_no = int($step/$block_size); my $tmp_hash_file = "tmp_hash".$block_no; open(OUTFILE,">$tmp_hash_file") or die "cannot write to file $ +tmp_hash_file!\n"; foreach(keys %tri_hash){ print OUTFILE "$_\t"; print OUTFILE "@{$tri_hash{$_}}\n"; } %tri_hash = ();#free memory } }#for k }#for j }#for i my $total_file_no = int($step/$block_size); open(OUTFILE,">$output") or die "cannot write to file $output!\n"; while(($my_key,$my_value)=each %total_hash_keys){ print OUTFILE $my_key."=>"; for(my $i=1;$i<$total_file_no + 1;$i++){ my $hash_file = "tmp_hash".$i; open(INFILE,"$hash_file") or die; while(<INFILE>){ my @array = split(/\t/,$_); if($array[0] eq $my_key){ chomp ($array[1]); print OUTFILE $array[1]; last; } } close(INFILE); } print OUTFILE "\n"; } sub transform{ my ($x,$y,$z,@t) = @_; my $new_x=$x*$t[0]+$y*$t[3]+$z*$t[6]; my $new_y=$x*$t[1]+$y*$t[4]+$z*$t[7]; my $new_z=$x*$t[2]+$y*$t[5]+$z*$t[8]; return ($new_x,$new_y,$new_z); }

      We are about 5% closer to understanding the bigger picture so at this point I'll give up trying to figure out how best to help you and simply toss a little database code in your direction instead:

      #!/usr/bin/env perl use strict; use warnings; use DBI; my $dbh = DBI->connect('dbi:SQLite:dbname=delme.sqlite', ''); $dbh->do('CREATE TABLE Bins (Xk INTEGER, Yk INTEGER, Zk INTEGER, Data +TEXT)'); my $sql = 'INSERT INTO Bins (Xk, Yk, Zk, Data) VALUES (?, ?, ?, ?)'; my $sth = $dbh->prepare ($sql); while (defined (my $data = <DATA>)) { my ($xKey, $yKey, $zKey) = split ' ', $data; chomp $data; $sth->execute((map {int} $xKey, $yKey, $zKey), $data); } $sql = 'SELECT * FROM Bins ORDER BY Xk, Yk, Zk'; $sth = $dbh->prepare($sql); $sth->execute(); while (my $row = $sth->fetchrow_hashref()) { print "$row->{Xk}, $row->{Yk}, $row->{Zk} => $row->{Data}\n"; } __DATA__ 4.941 32.586 -1.772 -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_ +0.041_0.205 15.354 22.823 10.556 -56.368_2.583_-28.745_0.883_-0.012_0.005_-0.382_0 +.041_0.205 -0.495 12.345 98.234 -0.382_0.041_0.205_-28.745_0.883_-0.012_0.005_-0. +382_0.041


      0, 12, 98 => -0.495 12.345 98.234 -0.382_0.041_0.205_-28.745_0.883_-0. +012_0.005_-0.382_0.041 4, 32, -1 => 4.941 32.586 -1.772 -44.368_23.583_-218.345_0.983_-0.012_ +0.005_-0.382_0.041_0.205 15, 22, 10 => 15.354 22.823 10.556 -56.368_2.583_-28.745_0.883_-0.012_ +0.005_-0.382_0.041_0.205
      True laziness is hard work
        Thank you very much for your help, GrandFather. I apologize for missing the point of your question. indeed building a database seems a plausible thing to do given the large quantity of data i have.
        i have about 10,000 such input files, each consisting of a point cloud. i am constructing a hash table for each input file, so in the end i have about 10,000 hashes. (not all hash tables are huge, as most files only have about 20 points as opposed to the 100 points that cause the problem i mentioned here)
        eventually what i'd like to do with these hashes is that i will do pairwise comparison and look for common keys between each pair. that information will be used to compute a distance/dissimilarity measure between the pair of point clouds from which the pair of hash tables being compared come from. in the very end i hope to perform clustering on the 10,000 sets of point clouds.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://951532]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2018-05-20 22:17 GMT
Find Nodes?
    Voting Booth?