Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^3: creating and managing many hashes

by poj (Abbot)
on Feb 18, 2018 at 14:35 UTC ( #1209441=note: print w/replies, xml ) Need Help??


in reply to Re^2: creating and managing many hashes
in thread creating and managing many hashes

To give us some more detail run this code against your data and post the output summary (not the report.txt file)

#!/usr/bin/perl use strict; use warnings; my $t0 = time(); my $infile = 'products.txt'; my %data = (); my %total = (); my $records = 0; open IN,'<',$infile or die "Could not open $infile $!"; while (<IN>){ my ($date, $product, $price, $qu) = split /\s+/,$_; $data{$date}{$product}{'price'} = $price; $data{$date}{$product}{'qu'} = $qu; $total{$product}{'count'} += 1; $total{$product}{'price'}{'sum'} += $price; $total{$product}{'qu'}{'sum'} += $qu; ++$records; } close IN; # calculate stats my $outfile = 'report.txt'; open OUT,'>',$outfile or die "Could not open $outfile"; for my $prod (keys %total){ my $count = $total{$prod}{'count'}; # mean $total{$prod}{'price'}{'mean'} = $total{$prod}{'price'}{'sum'}/$coun +t; $total{$prod}{'qu'}{'mean'} = $total{$prod}{'qu'}{'sum'}/$count; # std dev squared my ($sum_x2,$sum_y2); for my $date (keys %data){ my $x = $data{$date}{$prod}{'price'} - $total{$prod}{'price'}{'mea +n'}; $sum_x2 += ($x*$x); my $y = $data{$date}{$prod}{'qu'} - $total{$prod}{'qu'}{'mean'}; $sum_y2 += ($y*$y); } $total{$prod}{'price'}{'stddev'} = sprintf "%.4f",sqrt($sum_x2/$coun +t); $total{$prod}{'qu'}{'stddev'} = sprintf "%.4f",sqrt($sum_y2/$coun +t); my $line = join "\t",$prod, $total{$prod}{'price'}{'mean'}, $total{$prod}{'price'}{'stddev'}, $total{$prod}{'qu'}{'mean'}, $total{$prod}{'qu'}{'stddev'}; print OUT $line."\n"; } close OUT; # summary my $dur = time - $t0; printf " Products : %d Dates : %d Records : %d Run Time : %d s",0+keys %total, 0+keys %data, $records, $dur;

Update - code to create a 75MB test file

open OUT,'>','products.txt' or die "$!"; my @d = (0,31,28,31,30,31,30,31,31,30,31,30,31); for my $p ('0001'..'2000'){ my $product = "product_$p"; for my $y (2015..2017){ $d[2] = ($y % 4) ? 28 : 29; for my $m (1..12){ for my $d (1..$d[$m]){ my $date = sprintf "%04d-%02d-%02d",$y,$m,$d; my $price = int rand(500); my $qu = int rand(90_000); print OUT "$date\t$product\t$price\t$qu\n"; } } } } close OUT;

On my i5 desktop it takes about 5 seconds to correlate the price of 1 product against the other 1999. I guess 2 million pairs would be less than 2 hours

poj

Replies are listed 'Best First'.
Re^4: creating and managing many hashes
by Gtforce (Sexton) on Feb 24, 2018 at 13:17 UTC
    Products : 2008 Dates : 530 Records : 867434 Run Time : 57 s

    Thanks, poj. I'm new to hashes. The snippet of code you've provided, calculates the mean and std dev, but not the pair correlation. Am I correct in assuming that this bit will need to be built in and consequently the run times would look very different to what it does currently? Also, my data set unfortunately does not prices and inventories for all products on all days. I'd appreciate any advice you can provide. Thank you once again.

      You would need to detail the correlation algorithm you want to use. The simple one I tried was this where $v would be either 'price' or 'qu'. This assumes for any date you have the full data, so you need to decide how to handle the dates where you don't have 2 prices or qu to correlate.

      sub correlate { my ($p1,$p2,$v) = @_; return if $p1 eq $p2; my ($xy,$x2,$y2); for my $date (sort keys %data){ my $x = $data{$date}{$p1}{$v} - $total{$p1}{$v}{'mean'} ; my $y = $data{$date}{$p2}{$v} - $total{$p2}{$v}{'mean'} ; $xy += $x * $y; $x2 += $x * $x; $y2 += $y * $y; } my $cor = $xy / sqrt($x2 * $y2); print "$p1 $p2 $v $cor\n"; }

      Update - after seeing Re^2: creating and managing many hashes

      poj

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1209441]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (2)
As of 2021-06-14 05:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (60 votes). Check out past polls.

    Notices?