http://www.perlmonks.org?node_id=951519


in reply to how to merge many files of sorted hashes?

It may help to modify your many file plan by instead using a database.

If you can show us a demonstration version of the code and a small sample of the input data we may be able to show you how to handle the database side of a solution. An even better trick may be to make a sample data generator so we can scale the test data input according to our degree of patience.

True laziness is hard work
  • Comment on Re: how to merge many files of sorted hashes?

Replies are listed 'Best First'.
Re^2: how to merge many files of sorted hashes?
by andromedia33 (Novice) on Feb 02, 2012 at 20:46 UTC
    thanks for your reply! i start with 100 points each with x,y,z coordinates.
    4.941 32.586 -1.772 15.354 22.823 10.556 -0.495 12.345 98.234 ... $block_size = 1000; $counter = 0; %small_hash = (); foreach @triplet of the 100 points{ $counter++; my @matrix = ortho(@triplet); foreach $point in the 100 points{ my $key = binning(@matrix * $point); push @{$small_hash{$key}},@matrix; } #binning causes many points to fall in the same grid, #thus the array of values (matrices) for each key if($counter % $block_size == 0){ write out %small_hash; %small_hash = (); #frees the memory } } # now merging small hash files foreach $key (sort keys %master_hash){#keys for master_hash stored loop thru all small files{ while(<FILE>){ if($_ =~ /^$key/){ write values to the master output file; last; }#if }#while }#go thru files }
    so that's roughly the structure (actual code too long to post here). the first half is really fast to compute, but once i start merging files it takes forever. thanks a lot!

      All I learned from that about the big picture is that you are dealing with coordinates and you have some (undisclosed) way of making a key from the coordinates for each "something" you want to store. I still have no idea about what you actually want to store in a hashy thing, nor what you intend to do with it once stored.

      Without know what you want to eventually do with this stuff, I still think using a database (DBI and DBD::SQLite) is likely to be a good solution to the problem. But, again, if you can mock up something close to your real problem in 30 or so lines (fudge the calculations etc.) then we can probably give you a good starting point for the data management code.

      True laziness is hard work
        sorry i have not made it clearer. my actual hash looks like this:
        3_2_-1 => -44.368_23.583_-218.345_0.983_-0.012_0.005_-0.382_0.041_0.20 +5_-0.538_-0.876_0.100 -56.368_2.583_-28.745_0.883_-0.012_0.005_- +0.382_0.041_0.205_-0.538_-0.876_0.100 ...
        each element in the array of values is the entries of a 3*4 matrix, and each key can point to a few hundred of such 'matrices'. how i construct a key-value pair is as such: given a defined 3*4 matrix, i apply this transformation matrix (translation+rotation) to the coordinates of each of the 100 points, and obtain the new coor of that point, say (3.01,1.98,-0.87), which is discretized to be (3,2,-1). then (3,2,-1) is used as a key that points to such a transformation matrix.
        here's an example script with simplified calculations.
        my $input = $ARGV[0]; open(INFILE,"$input") or die "cannot open file $input!\n"; my $output = $ARGV[1]; my %total_hash_keys=(); my %tri_hash = (); #set bin width my $grid = 2; my $step = 0; my $block_size = 10000; my $block_no = 0; my @points; while(<INFILE>){my @array = split(/\t/,$_); push @points, [@array];} close(INFILE); #construct hash table for(my $i=0;$i<@points;$i++){ for(my $j=$i+1;$j<@points;$j++){ for(my $k=$j+1;$k<@points;$k++){ $step++; my @pt1 = (${$points[$i]}[0],${$points[$i]}[1],${$points[$i]}[2] +); my @pt2 = (${$points[$j]}[0],${$points[$j]}[1],${$points[$j]}[2] +); my @pt3 = (${$points[$k]}[0],${$points[$k]}[1],${$points[$k]}[2] +; #simplified calculation for the value of the hash; my @matrix = (@pt1,@pt2,@pt3); for(my $res=0;$res<@points;$res++){ #transform coor, and bin the new coor as a generated key my @old_xyz = @{$points[$res]}; my @new_xyz = transform(@old_xyz,@matrix); foreach(@new_xyz){$_ = int($_/$grid); } my $key = $new_xyz[0]."_".$new_xyz[1]."_".$new_xyz[2]; foreach(@matrix){$_ = sprintf "%.3f",$_;} my $value = ""; for(my $temp=0;$temp<@matrix;$temp++){$value .= $matrix[$temp] +."_"; } $total_hash_keys{$key}=0; push @{$tri_hash{$key}},$value; } if(($step % $block_size) == 0){#write to disk file $block_no = int($step/$block_size); my $tmp_hash_file = "tmp_hash".$block_no; open(OUTFILE,">$tmp_hash_file") or die "cannot write to file $ +tmp_hash_file!\n"; foreach(keys %tri_hash){ print OUTFILE "$_\t"; print OUTFILE "@{$tri_hash{$_}}\n"; } %tri_hash = ();#free memory } }#for k }#for j }#for i my $total_file_no = int($step/$block_size); open(OUTFILE,">$output") or die "cannot write to file $output!\n"; while(($my_key,$my_value)=each %total_hash_keys){ print OUTFILE $my_key."=>"; for(my $i=1;$i<$total_file_no + 1;$i++){ my $hash_file = "tmp_hash".$i; open(INFILE,"$hash_file") or die; while(<INFILE>){ my @array = split(/\t/,$_); if($array[0] eq $my_key){ chomp ($array[1]); print OUTFILE $array[1]; last; } } close(INFILE); } print OUTFILE "\n"; } sub transform{ my ($x,$y,$z,@t) = @_; my $new_x=$x*$t[0]+$y*$t[3]+$z*$t[6]; my $new_y=$x*$t[1]+$y*$t[4]+$z*$t[7]; my $new_z=$x*$t[2]+$y*$t[5]+$z*$t[8]; return ($new_x,$new_y,$new_z); }
      so let's see if I'm reading this right:
      • you have 100 points
      • for every possible triple (1,000,000 combinations) you're computing a matrix
      • each matrix gets pushed 100 times (onto 100 different keys)
      • you start a new file every 1000 matrices (so there'll be 1000 files when you're done, and each file will have 100,000 matrices.
      but you don't say how many distinct keys there are (i.e., how finely binning is grouping things).

      But it doesn't matter. For every single key you are doing a linear search through every file. So that's (number of keys) * 100,000,000 matrices to wade through; no wonder this is slow.

      You need to be writing each %small_hash in sorted order. Sorting 1000 small_hashes will be faster than sorting the big hash (1000*(n/1000)*log(n/1000) = n log(n/1000) < n log(n) but the really important thing is this will enable the merge I was talking about before, which only reads each file once rather than once for every key value.