Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: creating and managing many hashes

by Gtforce (Sexton)
on Feb 18, 2018 at 12:48 UTC ( #1209429=note: print w/replies, xml ) Need Help??


in reply to Re: creating and managing many hashes
in thread creating and managing many hashes

My current approach is going off flat files, and I think I need just two asynchronous processes resulting in two flat files. The first process looks at the products and works out the distinct pairs (i.e., going from 2,000 distinct products to 2 million distinct pairs).

pair 1: apple orange pair 2: apple banana pair 3: apple grape pair 4: orange banana pair 5: orange grape pair 6: banana grape

The above flat file serves as the input to the second process which looks at the price and inventory for each product and pair to work out the stats. The end result of this second process is written into the final flat file.

My tryst with perl and programming in general is about 3 months. The approach I'm taking looks neat and simple from a solution perspective to me (coding it however "feels" a little different, but am keen to put in the effort to keep the coding also simple). My reluctance to use an rdbms is that I'd probably have to teach myself "the how-to" especially when things break and fall apart.

Replies are listed 'Best First'.
Re^3: creating and managing many hashes
by poj (Abbot) on Feb 18, 2018 at 14:35 UTC

    To give us some more detail run this code against your data and post the output summary (not the report.txt file)

    #!/usr/bin/perl use strict; use warnings; my $t0 = time(); my $infile = 'products.txt'; my %data = (); my %total = (); my $records = 0; open IN,'<',$infile or die "Could not open $infile $!"; while (<IN>){ my ($date, $product, $price, $qu) = split /\s+/,$_; $data{$date}{$product}{'price'} = $price; $data{$date}{$product}{'qu'} = $qu; $total{$product}{'count'} += 1; $total{$product}{'price'}{'sum'} += $price; $total{$product}{'qu'}{'sum'} += $qu; ++$records; } close IN; # calculate stats my $outfile = 'report.txt'; open OUT,'>',$outfile or die "Could not open $outfile"; for my $prod (keys %total){ my $count = $total{$prod}{'count'}; # mean $total{$prod}{'price'}{'mean'} = $total{$prod}{'price'}{'sum'}/$coun +t; $total{$prod}{'qu'}{'mean'} = $total{$prod}{'qu'}{'sum'}/$count; # std dev squared my ($sum_x2,$sum_y2); for my $date (keys %data){ my $x = $data{$date}{$prod}{'price'} - $total{$prod}{'price'}{'mea +n'}; $sum_x2 += ($x*$x); my $y = $data{$date}{$prod}{'qu'} - $total{$prod}{'qu'}{'mean'}; $sum_y2 += ($y*$y); } $total{$prod}{'price'}{'stddev'} = sprintf "%.4f",sqrt($sum_x2/$coun +t); $total{$prod}{'qu'}{'stddev'} = sprintf "%.4f",sqrt($sum_y2/$coun +t); my $line = join "\t",$prod, $total{$prod}{'price'}{'mean'}, $total{$prod}{'price'}{'stddev'}, $total{$prod}{'qu'}{'mean'}, $total{$prod}{'qu'}{'stddev'}; print OUT $line."\n"; } close OUT; # summary my $dur = time - $t0; printf " Products : %d Dates : %d Records : %d Run Time : %d s",0+keys %total, 0+keys %data, $records, $dur;

    Update - code to create a 75MB test file

    open OUT,'>','products.txt' or die "$!"; my @d = (0,31,28,31,30,31,30,31,31,30,31,30,31); for my $p ('0001'..'2000'){ my $product = "product_$p"; for my $y (2015..2017){ $d[2] = ($y % 4) ? 28 : 29; for my $m (1..12){ for my $d (1..$d[$m]){ my $date = sprintf "%04d-%02d-%02d",$y,$m,$d; my $price = int rand(500); my $qu = int rand(90_000); print OUT "$date\t$product\t$price\t$qu\n"; } } } } close OUT;

    On my i5 desktop it takes about 5 seconds to correlate the price of 1 product against the other 1999. I guess 2 million pairs would be less than 2 hours

    poj
      Products : 2008 Dates : 530 Records : 867434 Run Time : 57 s

      Thanks, poj. I'm new to hashes. The snippet of code you've provided, calculates the mean and std dev, but not the pair correlation. Am I correct in assuming that this bit will need to be built in and consequently the run times would look very different to what it does currently? Also, my data set unfortunately does not prices and inventories for all products on all days. I'd appreciate any advice you can provide. Thank you once again.

        You would need to detail the correlation algorithm you want to use. The simple one I tried was this where $v would be either 'price' or 'qu'. This assumes for any date you have the full data, so you need to decide how to handle the dates where you don't have 2 prices or qu to correlate.

        sub correlate { my ($p1,$p2,$v) = @_; return if $p1 eq $p2; my ($xy,$x2,$y2); for my $date (sort keys %data){ my $x = $data{$date}{$p1}{$v} - $total{$p1}{$v}{'mean'} ; my $y = $data{$date}{$p2}{$v} - $total{$p2}{$v}{'mean'} ; $xy += $x * $y; $x2 += $x * $x; $y2 += $y * $y; } my $cor = $xy / sqrt($x2 * $y2); print "$p1 $p2 $v $cor\n"; }

        Update - after seeing Re^2: creating and managing many hashes

        poj
Re^3: creating and managing many hashes
by Laurent_R (Canon) on Feb 18, 2018 at 13:09 UTC
    Hi Gtforce,

    it will be very difficult to help you if you don't show your code and the data structures that you're using.

    My guess is that there is something inefficient in the way you're doing it. Probably a hash, or rather a hash of hashes (or possibly a HoHoH), would be more efficient, but there is really no way to tell without seeing your code.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1209429]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2021-06-12 23:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (53 votes). Check out past polls.

    Notices?