http://www.perlmonks.org?node_id=434973

rupesh has asked for the wisdom of the Perl Monks concerning the following question:


Greetings! I seek the help of fellow monks once again. This is with regard to a friend of mine, who gave me a csv containing a database dump.
Below is a sample of the dump.
Date Product Quantity Imported_From 1/2/04 Shirts 32 Australia 1/9/04 Shoes 234 Asia 2/12/04 Caps 109 UK 4/4/04 Shoes 6 Asia 4/4/04 Shirts 12 Australia 5/6/04 Shirts 398 Australia ..... ...... There are only 3 regions. One or more products can be imported from one or many regions on a sin +gle date. No single column is unique. There are many products, at the last count, it was 34.
What is needed is a summary of the report on a monthly basis. This would help in analyzing import-sales ratio for every month. (There is a similar dump for sales. That is a simpler one which does not include the 'region' field.) Output monthly report should look like:
Jan Product Asia Australia UK Total_Imported Shirts 178 12 744 934 Shoes ... ... (products whose total is zero can be omitted) Feb Product Asia Australia UK Total_Imported ....
I've worked on this for more than a couple of hours now using hashes and makeing '$date_$product' as key and quantity as its value. Using more hashes to seperate the regions. I've also tried to write to temp files and re-creating them for each month.

Is there a simple way to get the whole thing done in a single loop?
I'd appreciate any comments/thoughts from you folks.

Many thanks!
Rupesh.

Replies are listed 'Best First'.
Re: (Complex)Data Manipulation
by NetWallah (Canon) on Feb 28, 2005 at 06:27 UTC
    I think what you are looking for is a AOHOH, but my preference would be to let the database (SQL) do the aggregation and grouping.

    The AOHOH would be:

    my @db; $db[$monthNumber]->{ProductName}{RegionName} = $Quantity; # Loop through Month.. for (0..11){ #Cycle through Months, assuming zero-based next unless my %prod = %{$db[$_]}; # Any data there ? print "Some MONTH $_ header here\n"; foreach (sort keys %prod){ my %region = %{$prod{$_}}; # $region{RegionName} now has the qty to be printed # Hopefully, you get the idea now... } }
    Needless to say, the code above is untested..

        ..."I don't know what the facts are but somebody's certainly going to sit down with him and find out what he knows that they may not know, and make sure he knows what they know that he may not know, and that's a good thing. I think it's a very constructive exchange," --Donald Rumsfeld

      Oracle SQL-wise its pretty easy:

      select product, sum(decode(region, 'asia', quantity,0)) as "Asia", sum( decode(region, 'australia',quantity,0)) as "Australia", sum( decode(region, 'UK',quantity,0)) as "UK", sum(quantity) as "Total Imported"
      from whateverthetablenameis
      group by product

      AKA: Pivot report.

      AOHOH was what I was looking for.
      Even though I'm done with the job, I'm benchmarking all the other algorithms that fellow monks have provided.
      Many Thanks (Again)!
      Cheers,
      Rupesh.
Re: (Complex)Data Manipulation
by saintmike (Vicar) on Feb 28, 2005 at 06:56 UTC
    I've worked on this for more than a couple of hours now using hashes and makeing
    Any reason you didn't post the code you created so we could point out which parts need improvement?

    Anyway, you're gathering data in three dimensions: By month, within the month by product, within the product by country. This cries for a multi-dimensional hash.

    Here's an example grabbing the input with a regular expression, setting up a hash and printing out an incomplete result, but you should be able to fill in the gaps:

    my $data = join '', <DATA>; my $by_month = {}; while($data =~ /^(\d+).*?(\w+) +(\d+) +(\w+)/mg) { print "Capturing $1 $2 $3 $4\n"; $by_month->{$1}->{$2}->{$4} += $3; } for my $month (1..12) { print "Month: $month\n"; for my $product (qw(Shirts Shoes Caps)) { printf "%7s ", $product; for my $country (qw(Asia Australia UK)) { printf "%6d ", $by_month->{$month}->{$product}->{$country} || 0; } print "\n"; } } __DATA__ 1/2/04 Shirts 32 Australia 1/9/04 Shoes 234 Asia 2/12/04 Caps 109 UK 4/4/04 Shoes 6 Asia 4/4/04 Shirts 12 Australia 5/6/04 Shirts 398 Australia
    which prints out the following:
    Month: 1 Shirts 0 32 0 Shoes 234 0 0 Caps 0 0 0 Month: 2 Shirts 0 0 0 Shoes 0 0 0 Caps 0 0 109 Month: 3 Shirts 0 0 0 Shoes 0 0 0 Caps 0 0 0 Month: 4 Shirts 0 12 0 Shoes 6 0 0 Caps 0 0 0
Re: (Complex)Data Manipulation
by TedPride (Priest) on Feb 28, 2005 at 07:52 UTC
    Very neat, though it assumes a rather precise format. My solution is more adaptive, if a good deal more messy:
    @mon = ('','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct' +,'Nov','Dec'); $_ = <DATA>; chomp; $n{$_} = $c++ for split /\t/; for (<DATA>) { chomp; split /\t/; @date = split /\//, $_[$n{'Date'}]; $product = $_[$n{'Product'}]; $country = $_[$n{'Imported_From'}]; $quantity = $_[$n{'Quantity'}]; $countries{$country} = () if !exists $countries{$country}; $totals{$date[2]}{$date[0]}{$product}{$country} += $quantity; } @countries = sort keys %countries; for (sort {$a <=> $b} keys %totals) { $year = $totals{$_}; for (sort {$a <=> $b} keys %$year) { $month = $year->{$_}; print "$mon[$_]\n"; print join("\t", 'Product', @countries, 'Total'), "\n"; for (sort keys %$month) { print; $product = $month->{$_}; $total = 0; for (@countries) { print "\t".$product->{$_}; $total += $product->{$_}; } print "\t$total\n"; } print "\n"; } } __DATA__ Date Product Quantity Imported_From 1/2/04 Shirts 32 Australia 1/9/04 Shoes 234 Asia 2/12/04 Caps 109 UK 4/4/04 Shoes 6 Asia 4/4/04 Shirts 12 Australia 5/6/04 Shirts 398 Australia
    You can have any number of fields in any order, and the records can also be out of order. You're not limited to three countries either. Results are given in tab delimited format, perhaps not the most pretty thing to use, but I didn't feel like messing with prints after spending so much time writing the algorithm.
Re: (Complex)Data Manipulation
by holli (Abbot) on Feb 28, 2005 at 10:26 UTC
    ...containing a database dump.
    Why, if I may ask, donīt you use the means of the database to do grouping and reporting? That would be much more natural to me.


    holli, /regexed monk/