Gtforce has asked for the wisdom of the Perl Monks concerning the following question:

My data is as follows:

2018-01-01 apple 200 50000 2018-01-02 apple 201 60000 2018-01-03 apple 202 70000 2018-01-04 apple 198 80000 2018-01-01 orange 400 30000 2018-01-02 orange 401 35000 2018-01-03 orange 402 36000 2018-01-04 orange 405 28000

where apple and orange are my products, and I have over 2,000 distinct products. The values 201,202, etc are the prices of those products on the respective dates represented in the first column (and the quantities aka inventories are in the last column). My data series is for the past 3 years for each of these products.

I need to pass the data series (i.e., price and inventory) for one product at a time to a subroutine that calculates the mean,, etc.

I also need to pass the data series for a combination of two products at a time (a pair) to a subroutine that calculates the correlation between the two data series.

I have currently done this using arrays over the 2 million pairs that arise from 2,000 products and it ran for 4 days before my patience ran out and I terminated the process. I've only just started reading up on hashes and I think I can speed up things if I get the data series for one product into one hash, and the data series for another product into another hash (likewise for 2,000 products or hashes).

The reading I've done so far warns me against using variables for hashnames. Any advice you can offer would be greatly appreciated, thanks.

Update: using the advice on this thread, I was able to get to processing those stats for the 2 million odd pairs in 4hrs and 10 mins, and incidentally, learnt how to use hashes. Thank you

Replies are listed 'Best First'.
Re: creating and managing many hashes
by Discipulus (Abbot) on Feb 18, 2018 at 09:08 UTC
    Hello Gtforce,

    without other details (it's the data already in memory?) it's a matter of guessing, but..

    pass by reference? Are you filling new datastructures to pass your data around? If the original data it's alreaady read you can take a reference to some items and pass just the reference to your subroutines. See also the same matter at effective perl programmer and Is it possible to do pass by reference in Perl?

    This big amount of data and relations between parts let me think of databases: you can have a first step where you put all your data into a (temporary?) database (sqlite for example) and a second step where you just SELECT appropriate items and you update your statistics.

    Be sure of freeing no more used variables. Four days for just 2 millions records seems a bit slow for me.

    If my suggestions do not fit your task please provide a bit more details.



    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Discipulus, good advice regarding freeing up variables that are no longer in use (it not only was slowing processing, but was also giving me inaccurate results). undef()'ing them at the right places helped. Both, memory and cpu are not constraints (not even close to max'ing out on either with my current approach - tiny footprint on both using simple arrays and loops). Any advice on speeding up the first process of going from 2000 products to 2 million pairs would be greatly appreciated, thanks.
        If you are brave enough you can try parallel processing: try MCE by our brother marioroy

        Search here at perlmonks: he produced many working examples


        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: creating and managing many hashes
by LanX (Sage) on Feb 18, 2018 at 13:20 UTC
    > I have currently done this using arrays over the 2 million pairs that arise from 2,000 products and it ran for 4 days before

    I think your speed problems result from linear searches.

    You may want to organise your data in a nested structure.

    See perldsc And perlref

    Like (draft)

    $h_record = { date => '2018-01-01', product => 'apple', price => 200, quantity => 50000, }; $product{apple}{'2018-01-01'} = $h_record; $date{'2018-01-01'}{apple} = $h_record;
    • Like this you can easily lookup products and dates and pass records to subroutines.
    • Another hint is to cache and not repeat already done calculations.
    • Last but not least keep an eye on memory consumption, as soon as the system needs to swap heavily, you'll be lost in a black hole. (Only processing chunks may help here)

    Please note (like others mentioned) that this approach can't compete with an SQL DB in terms of efficiency.

    Those hashes need lots a memory where a SQL is managing an index on columns of the same table. (Think of all the hashes you'd need to easily look up year, month and day separately)

    Also reading files to build the data structure will cause you overhead where SQL is persistent.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

Re: creating and managing many hashes
by Marshall (Canon) on Feb 18, 2018 at 19:57 UTC
    HI Gtforce,

    I suppose that this flat data file is coming from some fancy corporate sales/inventory DB. This may sound flippant, but buying some dinner and drinks for the person who generated file for you might yield the most efficient/effective solution for you! But I guess you have already considered that...

    It sounds conceivable that your data processing could all be done in an Excel Spreadsheet with no Perl programming at all. I haven't done any serious spreadsheet work in years, but spreadsheets can be huge now, 2 million rows is possible.

    Let's talk about Perl:
    You are inexperienced at Perl and sounds like you have no SQL experience. However, I believe that a solution that involves learning the "least amount of new stuff" will involve learning a targeted subset combination of both Perl and SQL. Using an SQLite DB will simplify the data structures that the Perl code has to work with (less fancy Perl to learn). I believe that learning basic DBI will simplify your Perl code.

    SQlite is the most used DB in the world because it is on every smart phone. SQlite doesn't require any fancy server setup and admin - it uses a simple file for its work. So huge admin hassles just disappear. You will need to learn how to create tables, insert new records, select (i.e. get) records from the DB. Only a very,very small subset of SQL needs to be learned. For the Perl I/F with SQLite, you will need to learn a subset of Perl data structures. I recommend only one: how handle an AoA, a 2D array structure or a reference to such a thing. Don't start with learning everything, just learn this fetchall_arrayref() function well.

    From what I see so far, a basic idea could be:

    • Create SQlite DB table(s) for Input Data and Stat Table
    • Import Data - This runs at maybe 50,000 records per second. Should be run as a single transaction.
    • get list of unique product names - one SQL command.
    • foreach product, calculate stats and put in Stat Table - slower than import, but fast
    • generate the 2 million combinations - fast
    • Run each combo, results appear at 10,000 per second or faster
    The idea of this running for 4 hours is insane. Something is seriously wrong if this doesn't run in <4 minutes.
Re: creating and managing many hashes
by 1nickt (Abbot) on Feb 18, 2018 at 12:03 UTC

    It would be completely crazy to do this without using a database. See for example DBD::SQLite (Perl package containing both DB server and Perl libraries).

    The way forward always starts with a minimal test.

      My current approach is going off flat files, and I think I need just two asynchronous processes resulting in two flat files. The first process looks at the products and works out the distinct pairs (i.e., going from 2,000 distinct products to 2 million distinct pairs).

      pair 1: apple orange pair 2: apple banana pair 3: apple grape pair 4: orange banana pair 5: orange grape pair 6: banana grape

      The above flat file serves as the input to the second process which looks at the price and inventory for each product and pair to work out the stats. The end result of this second process is written into the final flat file.

      My tryst with perl and programming in general is about 3 months. The approach I'm taking looks neat and simple from a solution perspective to me (coding it however "feels" a little different, but am keen to put in the effort to keep the coding also simple). My reluctance to use an rdbms is that I'd probably have to teach myself "the how-to" especially when things break and fall apart.

        To give us some more detail run this code against your data and post the output summary (not the report.txt file)

        #!/usr/bin/perl use strict; use warnings; my $t0 = time(); my $infile = 'products.txt'; my %data = (); my %total = (); my $records = 0; open IN,'<',$infile or die "Could not open $infile $!"; while (<IN>){ my ($date, $product, $price, $qu) = split /\s+/,$_; $data{$date}{$product}{'price'} = $price; $data{$date}{$product}{'qu'} = $qu; $total{$product}{'count'} += 1; $total{$product}{'price'}{'sum'} += $price; $total{$product}{'qu'}{'sum'} += $qu; ++$records; } close IN; # calculate stats my $outfile = 'report.txt'; open OUT,'>',$outfile or die "Could not open $outfile"; for my $prod (keys %total){ my $count = $total{$prod}{'count'}; # mean $total{$prod}{'price'}{'mean'} = $total{$prod}{'price'}{'sum'}/$coun +t; $total{$prod}{'qu'}{'mean'} = $total{$prod}{'qu'}{'sum'}/$count; # std dev squared my ($sum_x2,$sum_y2); for my $date (keys %data){ my $x = $data{$date}{$prod}{'price'} - $total{$prod}{'price'}{'mea +n'}; $sum_x2 += ($x*$x); my $y = $data{$date}{$prod}{'qu'} - $total{$prod}{'qu'}{'mean'}; $sum_y2 += ($y*$y); } $total{$prod}{'price'}{'stddev'} = sprintf "%.4f",sqrt($sum_x2/$coun +t); $total{$prod}{'qu'}{'stddev'} = sprintf "%.4f",sqrt($sum_y2/$coun +t); my $line = join "\t",$prod, $total{$prod}{'price'}{'mean'}, $total{$prod}{'price'}{'stddev'}, $total{$prod}{'qu'}{'mean'}, $total{$prod}{'qu'}{'stddev'}; print OUT $line."\n"; } close OUT; # summary my $dur = time - $t0; printf " Products : %d Dates : %d Records : %d Run Time : %d s",0+keys %total, 0+keys %data, $records, $dur;

        Update - code to create a 75MB test file

        open OUT,'>','products.txt' or die "$!"; my @d = (0,31,28,31,30,31,30,31,31,30,31,30,31); for my $p ('0001'..'2000'){ my $product = "product_$p"; for my $y (2015..2017){ $d[2] = ($y % 4) ? 28 : 29; for my $m (1..12){ for my $d (1..$d[$m]){ my $date = sprintf "%04d-%02d-%02d",$y,$m,$d; my $price = int rand(500); my $qu = int rand(90_000); print OUT "$date\t$product\t$price\t$qu\n"; } } } } close OUT;

        On my i5 desktop it takes about 5 seconds to correlate the price of 1 product against the other 1999. I guess 2 million pairs would be less than 2 hours

        Hi Gtforce,

        it will be very difficult to help you if you don't show your code and the data structures that you're using.

        My guess is that there is something inefficient in the way you're doing it. Probably a hash, or rather a hash of hashes (or possibly a HoHoH), would be more efficient, but there is really no way to tell without seeing your code.

Re: creating and managing many hashes
by Anonymous Monk on Feb 18, 2018 at 14:34 UTC
    My immediate thought is that you should load these data into an SQLite database (file), then use a real statistics package such as R or even a spreadsheet to do the rest of the work. The statistical processes that you describe are familiar ones and the number-of-observations is actually considered very small. Custom-programming is a very inefficient and largely unnecessary way to approach such a task: you're building a "one-off solution" to a universal task.
      Fully agree with that. If you want to stay in Perl-land and also get stats and performance without doing much work, take a serious look at PDL::Stats.
Re: creating and managing many hashes
by Cristoforo (Curate) on Feb 20, 2018 at 03:10 UTC
    What formula do you use to calculate the correlation between the pairs? Does it involve the stddev, mean?

      Definitely involves mean and stddev. Hopefully the snippet shows how I'm trying to do this, but do let me know if there is another way, thanks.

      foreach my $subrow (keys %pair1data) { $zee1 = ( ( $pair1data{$subrow} - $submean1 ) / $substddev1 ); $zee2 = ( ( $pair2data{$subrow} - $submean2 ) / $substddev2 ); $sigma += ( $zee1 * $zee2 ); $datasetcounter ++; } $r = ( $sigma / ( $datasetcounter - 1 ) ); return $r;
Re: creating and managing many hashes
by Gtforce (Sexton) on Feb 19, 2018 at 09:39 UTC

    An update: I spent the day reading-about/playing-with hashes, and decided to use them to the problem I had.

    Satisfied with the results I now have at a constant 20% single-core utilization (approx), and a peak 1.5GB memory footprint.

    $>perl Distinct products: 2083 Number of pairs (incl. duplicates): 4,336,806 Number of pairs (excl. duplicates): 2,168,403 Started processing at 14:04:49 Finished processing at 14:07:24 Finished writing to disk at 14:07:43 $>

    PS: This is to determine the pairs only (haven't run the stats using the pairs yet which is the next step).

      This is to determine the pairs only

      That should take seconds, not minutes.

      #!/usr/bin/perl use strict; my $t0 = time; my @f = map{ sprintf 'product_%04d',$_ } 1..2083; my $count = 0; open OUT,'>','pairs.txt' or die "$!"; while (my $x = shift @f){ for my $y (@f){ ++$count; print OUT "$count\t$x\t$y\n"; } } close OUT; my $dur = time-$t0; print "$count in $dur s\n"; # 19s on raspberryPI

        poj - that is amazing (took 3 seconds)!!

        Will do a post-mortem now, thanks a ton.

      I'm afraid, only telling us about "results" without showing code will only frustrate many and won't help you improve your skills.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Wikisyntax for the Monastery