http://www.perlmonks.org?node_id=951528


in reply to Re: how to merge many files of sorted hashes?
in thread how to merge many files of sorted hashes?

thanks for your reply! i start with 100 points each with x,y,z coordinates.
4.941 32.586 -1.772 15.354 22.823 10.556 -0.495 12.345 98.234 ... $block_size = 1000; $counter = 0; %small_hash = (); foreach @triplet of the 100 points{ $counter++; my @matrix = ortho(@triplet); foreach $point in the 100 points{ my $key = binning(@matrix * $point); push @{$small_hash{$key}},@matrix; } #binning causes many points to fall in the same grid, #thus the array of values (matrices) for each key if($counter % $block_size == 0){ write out %small_hash; %small_hash = (); #frees the memory } } # now merging small hash files foreach $key (sort keys %master_hash){#keys for master_hash stored loop thru all small files{ while(<FILE>){ if($_ =~ /^$key/){ write values to the master output file; last; }#if }#while }#go thru files }
so that's roughly the structure (actual code too long to post here). the first half is really fast to compute, but once i start merging files it takes forever. thanks a lot!

Replies are listed 'Best First'.
Re^3: how to merge many files of sorted hashes?
by GrandFather (Saint) on Feb 02, 2012 at 21:08 UTC

    All I learned from that about the big picture is that you are dealing with coordinates and you have some (undisclosed) way of making a key from the coordinates for each "something" you want to store. I still have no idea about what you actually want to store in a hashy thing, nor what you intend to do with it once stored.

    Without know what you want to eventually do with this stuff, I still think using a database (DBI and DBD::SQLite) is likely to be a good solution to the problem. But, again, if you can mock up something close to your real problem in 30 or so lines (fudge the calculations etc.) then we can probably give you a good starting point for the data management code.

    True laziness is hard work
      sorry i have not made it clearer. my actual hash looks like this:
      3_2_-1 => -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_0.041_0.20 +5_-0.538_-0.876_0.100 -56.368_2.583_-28.745_0.883_-0.012_0.005_- +0.382_0.041_0.205_-0.538_-0.876_0.100 ...
      each element in the array of values is the entries of a 3*4 matrix, and each key can point to a few hundred of such 'matrices'. how i construct a key-value pair is as such: given a defined 3*4 matrix, i apply this transformation matrix (translation+rotation) to the coordinates of each of the 100 points, and obtain the new coor of that point, say (3.01,1.98,-0.87), which is discretized to be (3,2,-1). then (3,2,-1) is used as a key that points to such a transformation matrix.
      here's an example script with simplified calculations.
      my $input = $ARGV[0]; open(INFILE,"$input") or die "cannot open file $input!\n"; my $output = $ARGV[1]; my %total_hash_keys=(); my %tri_hash = (); #set bin width my $grid = 2; my $step = 0; my $block_size = 10000; my $block_no = 0; my @points; while(<INFILE>){my @array = split(/\t/,$_); push @points, [@array];} close(INFILE); #construct hash table for(my $i=0;$i<@points;$i++){ for(my $j=$i+1;$j<@points;$j++){ for(my $k=$j+1;$k<@points;$k++){ $step++; my @pt1 = (${$points[$i]}[0],${$points[$i]}[1],${$points[$i]}[2] +); my @pt2 = (${$points[$j]}[0],${$points[$j]}[1],${$points[$j]}[2] +); my @pt3 = (${$points[$k]}[0],${$points[$k]}[1],${$points[$k]}[2] +; #simplified calculation for the value of the hash; my @matrix = (@pt1,@pt2,@pt3); for(my $res=0;$res<@points;$res++){ #transform coor, and bin the new coor as a generated key my @old_xyz = @{$points[$res]}; my @new_xyz = transform(@old_xyz,@matrix); foreach(@new_xyz){$_ = int($_/$grid); } my $key = $new_xyz[0]."_".$new_xyz[1]."_".$new_xyz[2]; foreach(@matrix){$_ = sprintf "%.3f",$_;} my $value = ""; for(my $temp=0;$temp<@matrix;$temp++){$value .= $matrix[$temp] +."_"; } $total_hash_keys{$key}=0; push @{$tri_hash{$key}},$value; } if(($step % $block_size) == 0){#write to disk file $block_no = int($step/$block_size); my $tmp_hash_file = "tmp_hash".$block_no; open(OUTFILE,">$tmp_hash_file") or die "cannot write to file $ +tmp_hash_file!\n"; foreach(keys %tri_hash){ print OUTFILE "$_\t"; print OUTFILE "@{$tri_hash{$_}}\n"; } %tri_hash = ();#free memory } }#for k }#for j }#for i my $total_file_no = int($step/$block_size); open(OUTFILE,">$output") or die "cannot write to file $output!\n"; while(($my_key,$my_value)=each %total_hash_keys){ print OUTFILE $my_key."=>"; for(my $i=1;$i<$total_file_no + 1;$i++){ my $hash_file = "tmp_hash".$i; open(INFILE,"$hash_file") or die; while(<INFILE>){ my @array = split(/\t/,$_); if($array[0] eq $my_key){ chomp ($array[1]); print OUTFILE $array[1]; last; } } close(INFILE); } print OUTFILE "\n"; } sub transform{ my ($x,$y,$z,@t) = @_; my $new_x=$x*$t[0]+$y*$t[3]+$z*$t[6]; my $new_y=$x*$t[1]+$y*$t[4]+$z*$t[7]; my $new_z=$x*$t[2]+$y*$t[5]+$z*$t[8]; return ($new_x,$new_y,$new_z); }

        We are about 5% closer to understanding the bigger picture so at this point I'll give up trying to figure out how best to help you and simply toss a little database code in your direction instead:

        #!/usr/bin/env perl use strict; use warnings; use DBI; my $dbh = DBI->connect('dbi:SQLite:dbname=delme.sqlite', ''); $dbh->do('CREATE TABLE Bins (Xk INTEGER, Yk INTEGER, Zk INTEGER, Data +TEXT)'); my $sql = 'INSERT INTO Bins (Xk, Yk, Zk, Data) VALUES (?, ?, ?, ?)'; my $sth = $dbh->prepare ($sql); while (defined (my $data = <DATA>)) { my ($xKey, $yKey, $zKey) = split ' ', $data; chomp $data; $sth->execute((map {int} $xKey, $yKey, $zKey), $data); } $sql = 'SELECT * FROM Bins ORDER BY Xk, Yk, Zk'; $sth = $dbh->prepare($sql); $sth->execute(); while (my $row = $sth->fetchrow_hashref()) { print "$row->{Xk}, $row->{Yk}, $row->{Zk} => $row->{Data}\n"; } __DATA__ 4.941 32.586 -1.772 -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_ +0.041_0.205 15.354 22.823 10.556 -56.368_2.583_-28.745_0.883_-0.012_0.005_-0.382_0 +.041_0.205 -0.495 12.345 98.234 -0.382_0.041_0.205_-28.745_0.883_-0.012_0.005_-0. +382_0.041

        Prints:

        0, 12, 98 => -0.495 12.345 98.234 -0.382_0.041_0.205_-28.745_0.883_-0. +012_0.005_-0.382_0.041 4, 32, -1 => 4.941 32.586 -1.772 -44.368_23.583_-218.345_0.983_-0.012_ +0.005_-0.382_0.041_0.205 15, 22, 10 => 15.354 22.823 10.556 -56.368_2.583_-28.745_0.883_-0.012_ +0.005_-0.382_0.041_0.205
        True laziness is hard work
Re^3: how to merge many files of sorted hashes?
by wrog (Friar) on Feb 02, 2012 at 23:25 UTC
    so let's see if I'm reading this right:
    • you have 100 points
    • for every possible triple (1,000,000 combinations) you're computing a matrix
    • each matrix gets pushed 100 times (onto 100 different keys)
    • you start a new file every 1000 matrices (so there'll be 1000 files when you're done, and each file will have 100,000 matrices.
    but you don't say how many distinct keys there are (i.e., how finely binning is grouping things).

    But it doesn't matter. For every single key you are doing a linear search through every file. So that's (number of keys) * 100,000,000 matrices to wade through; no wonder this is slow.

    You need to be writing each %small_hash in sorted order. Sorting 1000 small_hashes will be faster than sorting the big hash (1000*(n/1000)*log(n/1000) = n log(n/1000) < n log(n) but the really important thing is this will enable the merge I was talking about before, which only reads each file once rather than once for every key value.