http://www.perlmonks.org?node_id=1209427


in reply to creating and managing many hashes

It would be completely crazy to do this without using a database. See for example DBD::SQLite (Perl package containing both DB server and Perl libraries).


The way forward always starts with a minimal test.

Replies are listed 'Best First'.
Re^2: creating and managing many hashes
by Gtforce (Sexton) on Feb 18, 2018 at 12:48 UTC

    My current approach is going off flat files, and I think I need just two asynchronous processes resulting in two flat files. The first process looks at the products and works out the distinct pairs (i.e., going from 2,000 distinct products to 2 million distinct pairs).

    pair 1: apple orange pair 2: apple banana pair 3: apple grape pair 4: orange banana pair 5: orange grape pair 6: banana grape

    The above flat file serves as the input to the second process which looks at the price and inventory for each product and pair to work out the stats. The end result of this second process is written into the final flat file.

    My tryst with perl and programming in general is about 3 months. The approach I'm taking looks neat and simple from a solution perspective to me (coding it however "feels" a little different, but am keen to put in the effort to keep the coding also simple). My reluctance to use an rdbms is that I'd probably have to teach myself "the how-to" especially when things break and fall apart.

      To give us some more detail run this code against your data and post the output summary (not the report.txt file)

      #!/usr/bin/perl use strict; use warnings; my $t0 = time(); my $infile = 'products.txt'; my %data = (); my %total = (); my $records = 0; open IN,'<',$infile or die "Could not open $infile $!"; while (<IN>){ my ($date, $product, $price, $qu) = split /\s+/,$_; $data{$date}{$product}{'price'} = $price; $data{$date}{$product}{'qu'} = $qu; $total{$product}{'count'} += 1; $total{$product}{'price'}{'sum'} += $price; $total{$product}{'qu'}{'sum'} += $qu; ++$records; } close IN; # calculate stats my $outfile = 'report.txt'; open OUT,'>',$outfile or die "Could not open $outfile"; for my $prod (keys %total){ my $count = $total{$prod}{'count'}; # mean $total{$prod}{'price'}{'mean'} = $total{$prod}{'price'}{'sum'}/$coun +t; $total{$prod}{'qu'}{'mean'} = $total{$prod}{'qu'}{'sum'}/$count; # std dev squared my ($sum_x2,$sum_y2); for my $date (keys %data){ my $x = $data{$date}{$prod}{'price'} - $total{$prod}{'price'}{'mea +n'}; $sum_x2 += ($x*$x); my $y = $data{$date}{$prod}{'qu'} - $total{$prod}{'qu'}{'mean'}; $sum_y2 += ($y*$y); } $total{$prod}{'price'}{'stddev'} = sprintf "%.4f",sqrt($sum_x2/$coun +t); $total{$prod}{'qu'}{'stddev'} = sprintf "%.4f",sqrt($sum_y2/$coun +t); my $line = join "\t",$prod, $total{$prod}{'price'}{'mean'}, $total{$prod}{'price'}{'stddev'}, $total{$prod}{'qu'}{'mean'}, $total{$prod}{'qu'}{'stddev'}; print OUT $line."\n"; } close OUT; # summary my $dur = time - $t0; printf " Products : %d Dates : %d Records : %d Run Time : %d s",0+keys %total, 0+keys %data, $records, $dur;

      Update - code to create a 75MB test file

      open OUT,'>','products.txt' or die "$!"; my @d = (0,31,28,31,30,31,30,31,31,30,31,30,31); for my $p ('0001'..'2000'){ my $product = "product_$p"; for my $y (2015..2017){ $d[2] = ($y % 4) ? 28 : 29; for my $m (1..12){ for my $d (1..$d[$m]){ my $date = sprintf "%04d-%02d-%02d",$y,$m,$d; my $price = int rand(500); my $qu = int rand(90_000); print OUT "$date\t$product\t$price\t$qu\n"; } } } } close OUT;

      On my i5 desktop it takes about 5 seconds to correlate the price of 1 product against the other 1999. I guess 2 million pairs would be less than 2 hours

      poj
        Products : 2008 Dates : 530 Records : 867434 Run Time : 57 s

        Thanks, poj. I'm new to hashes. The snippet of code you've provided, calculates the mean and std dev, but not the pair correlation. Am I correct in assuming that this bit will need to be built in and consequently the run times would look very different to what it does currently? Also, my data set unfortunately does not prices and inventories for all products on all days. I'd appreciate any advice you can provide. Thank you once again.

      Hi Gtforce,

      it will be very difficult to help you if you don't show your code and the data structures that you're using.

      My guess is that there is something inefficient in the way you're doing it. Probably a hash, or rather a hash of hashes (or possibly a HoHoH), would be more efficient, but there is really no way to tell without seeing your code.