Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Efficient use of memory

by K_M_McMahon (Hermit)
on Jun 04, 2005 at 06:27 UTC ( [id://463483]=perlquestion: print w/replies, xml ) Need Help??

K_M_McMahon has asked for the wisdom of the Perl Monks concerning the following question:

Hey Monks,
I created a script to read in generic Spacecraft data files in a CSV format. I wanted to be able to validate some of our other software's calculation of Min/Mean/Max/StdDev/NumOfPoints. I wanted the script to be able to work generically on any of our data files without knowing ahead of time how many columns etc there are going to be.

The below script does exactly what I want it to, all of the data validates correctly. On most of my Computers (linux and WinXP) The script processes a 30 Megabyte version of the sample data (included below) in under a minute. I have one machine that only has 128 MB of ram and the script doesn't complete after 30 minutes.

Any suggestions on how to make it more efficient in it's use of memory? I know my method of populating my Hashes of Arrays seems long-winded but I couldn't think of another way to do it....

Script
#!/usr/bin/perl -w use strict; use Statistics::Basic::Mean; use Statistics::Basic::StdDev; my $filename='224_APID003_report.csv'; my (%HoA,%hash_keys); open (READ_IN,"<$filename") or die "I can't open $filename to read.\n" +; while (<READ_IN>) { chomp; create_hash($_) if /^Year/; pop_hash($_) if /^\d{4},/; } close(READ_IN); foreach (sort{$a<=>$b}keys(%hash_keys)) { next if $hash_keys{$_}=~m/(?:TIME|YEAR)/i; my $count=@{$HoA{$hash_keys{$_}}}; my $pointer=\@{$HoA{$hash_keys{$_}}}; my @hi_low=sort{$a<=>$b}@{$HoA{$hash_keys{$_}}}; my $low=shift(@hi_low); my $hi=pop(@hi_low); my $mean = Statistics::Basic::Mean->new($pointer)->query; my $stddev = Statistics::Basic::StdDev->new($pointer)->query; print "$hash_keys{$_}: MIN:($low) MEAN:($mean) MAX:($hi) STDEV:($s +tddev) POINTS:($count)\n"; } ################################ #Subroutines ################################ sub create_hash { my @columns=split(/,/,shift); my $i=0; foreach (@columns) { $i++; $hash_keys{$i}=$_; $HoA{$_}=(); } } sub pop_hash { my @values=split(/,/,shift); foreach (sort{$a<=>$b}keys(%hash_keys)) { push(@{$HoA{$hash_keys{$_}}},shift(@values)); } }



Sample Data
Year,S/C Time,224_P003STIME,224_P003PVNO,224_P003PCKT,224_P003SHDF,224 +_P003ID,224_P003SEGF,224_P003SCNT,224_P003PLEN,224_P003STIME,224_P003 +MTIME,224_MCDHANSGND,224_MCDH5VSVOLT,224_MCDH5VSCUR,224_MCDH33VSVOLT, +224_MCDH33VSCUR,224_MCDH25VSVOLT,224_MCDH25VSCUR,224_PBUSURBVOLT,224_ +PEPURLCUR,224_PBATV1,224_PBATFIVOLT,224_PBATCUR,224_PBATCPOL,224_PBAT +CPOLV,224_PEPSAVOLT,224_PEPSACUR,224_PBATHV1,224_PBATHV2,224_PBATHV3, +224_PBATHV4,224_PBATHV5,224_PBATHV6,224_PEP5VBM,224_PBATV2,224_AMAGCU +R,224_MCDH21VRVOLT,224_XRCVAGCGS,224_XRCVCLSTR,224_XRCVRFPS 2005,115-00:00:00.095,05-115-00:00:00.095,0,0,1,3,3,11466,35,05-115-00 +:00:00.095,04-115-23:59:50.095,0.000000,4.961763,0.496248,3.320780,0. +094080,2.519886,0.037647,6.983401,1.290000,0.247917,5.830000,0.045000 +,49,0.735000,7.443989,1.575000,0.125373,0.125373,0.125373,0.125373,0. +125373,0.125373,5.273542,0.000000,0.138108,2.090138,0.019593,0.019593 +,20.753000 2005,115-00:00:01.028,05-115-00:00:01.028,0,0,1,3,3,11467,35,05-115-00 +:00:01.028,04-115-23:59:51.028,0.000000,4.961763,0.496248,3.320780,0. +094080,2.519886,0.037647,7.018670,1.320000,0.247917,5.830000,0.045000 +,49,0.735000,7.443989,1.575000,0.125373,0.125373,0.125373,0.125373,0. +125373,0.125373,5.273542,0.000000,0.137260,2.090138,0.019593,0.019593 +,20.753000 2005,115-00:00:02.028,05-115-00:00:02.028,0,0,1,3,3,11468,35,05-115-00 +:00:02.028,04-115-23:59:52.028,0.000000,4.961763,0.496248,3.320780,0. +094080,2.519886,0.037647,7.018670,1.290000,0.247917,5.830000,0.045000 +,49,0.735000,7.443989,1.575000,0.125373,0.125373,0.125373,0.125373,0. +125373,0.125373,5.273542,0.000000,0.137260,2.090138,0.019593,0.019593 +,20.753000 2005,115-00:00:03.036,05-115-00:00:03.036,0,0,1,3,3,11469,35,05-115-00 +:00:03.036,04-115-23:59:53.036,0.000000,4.961763,0.496248,3.320780,0. +094080,2.519886,0.037647,7.018670,1.275000,0.247917,5.830000,0.045000 +,49,0.735000,7.443989,1.575000,0.125373,0.125373,0.125373,0.125373,0. +125373,0.125373,5.273542,0.000000,0.138108,2.090138,0.019593,0.019593 +,20.753000 2005,115-00:00:04.094,05-115-00:00:04.094,0,0,1,3,3,11470,35,05-115-00 +:00:04.094,04-115-23:59:54.094,0.000000,4.961763,0.496248,3.320780,0. +094080,2.519886,0.037647,7.018670,1.290000,0.247917,5.830000,0.045000 +,49,0.735000,7.443989,1.575000,0.125373,0.125373,0.125373,0.125373,0. +125373,0.125373,5.273542,0.000000,0.137260,2.090138,0.019593,0.019593 +,20.753000
Thanx!


-Kevin
my $a='62696c6c77667269656e6440676d61696c2e636f6d'; while ($a=~m/(^.{2})/s) {print unpack('A',pack('H*',"$1"));$a=~s/^.{2}//s;}

Replies are listed 'Best First'.
Re: Efficient use of memory ( 1/7th the memory requirement)
by BrowserUk (Patriarch) on Jun 04, 2005 at 11:09 UTC

    By accumulating your float values as strings of packed values rather than arrays, 100,000 doubles requires only 800 kb instead of ~ 2.5 MB. Multiply that by the 36 vectors in your sample dataset and you reduce the memory requirement for processing 100,000 lines (35MB) from 350 MB to 50 MB with no loss of performance. It should now run on your 128 MB machine easily without swapping.

    Original (tweaked) + results

    Modified + results


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Efficient use of memory
by ivancho (Hermit) on Jun 04, 2005 at 07:07 UTC
    your data structures confuse me.. sorry, it's a late evening..
    Storing everything in memory seems bad, if you only need mean, variance and such. Why don't you check Statistics::Descriptive. It allows you to save things sparsely ( ie, only their main statistical properties, rather than all the datapoints.. ).
    update:Of course, it does that by doing arithmetic stuff on each add, but this should be negligible compared to the memory savings.. - it might slow you down, if you had 100 million rows, but think of what those would do to your memory..

    Thus, when you parse your "Year" line, you know what variables you want the stats for - create one Statistics.. object for each, and from there on just add datapoints from each split.. also, I'd rather use an array - something like

    #!/usr/bin/perl -lw use strict; use Statistics::Descriptive; my $filename='224_APID003_report.csv'; open (READ_IN,"<$filename") or die "I can't open $filename to read.\n" +; my @idxs; my %vars; my @names; while (<READ_IN>) { chomp; /^Year/ && do { @names = split /,/; @idxs = grep {$names[$_] !~ /TIME|YEAR/i } (0..@names-1); $vars{$_} = Statistics::Descriptive::Sparse->new() for @names[ +@idxs]; }; /^\d{4}/ && do { my @values = split /,/; $vars{$names[$_]}->add_data($values[$_]) for @idxs; }; } close READ_IN or die $!; foreach (keys %vars) { printf "%20s: mean = %10.4f, var = %10.4f\n",$_, $vars{$_}->mean() +, $vars{$_}->variance(); }

    btw, this is not tested, I might be writing rubbish...

    Update: Tested, corrected, prettyfied, added all the details. I hope it works for you. The other Sparse methods of Stats::Desc seem to cover everything you need..

    update2: I am aware that the excessive use of $_ throughout this piece of code makes it more difficult to read than expanding all the loops. On the other hand, I think I'm way too attached to grep, map and inverse for, with their terseness... I might eventually write a meditation about Perl and Bulgarian language ..

Re: Efficient use of memory
by salva (Canon) on Jun 04, 2005 at 11:17 UTC
    You are overcomplicating things a lot. Some advice:

    Use meaningful names for your vars and subroutines. %HoA, %hash_keys, $pointer, parse_hash or create_hash are very bad names because they don't say anything about the data inside or what they do.

    Don't use global variables (%hash_keys, %AoH) to pass or get data from subroutines.

    Try to simplify your structures, deep object trees lead to difficult to understand code. Also thing about the proper type to use in every case. For instance, it makes no sense to use a hash to store a list of ordered values.

    And finally, modules are good only when they simplify your problem, to calculate the mean and deviation you don't really need a module!

    use strict; use warnings; no warnings 'uninitialized'; my (%count, %sum, %sum2, %min, %max, @key); while(<>) { my (@val) = split /,/; if ($val[0]=~/^Year/) { @key = @val; } else { @key == @val or die "number of values and keys don't match" for my $i (0..$#key) { my $key = $key[$i]; next if $key =~/Year|Time/; my $val = $val[$i]; $count{$key} ++; $sum{$key} += $val; $sum2{$key} += $val*$val; $min{$key} = $val if (not defined $min{$key} or $val < $min{$key}); $max{$key} = $val if (not defined $max{$key} or $val > $max{$key}); } } } for my $key (sort keys %count) { my $count = $count{$key} my $sum = $sum{$key}; my $mean = $sum/$count; my $deviation = sqrt($sum2{$key}/$count - $mean*$mean); printf("key: %s, mean: %f, deviation: %f, min: %f, max: %f", $key, $mean, $deviation, $min{$key}, $max{$key}); }

    oh, and consider using Text::xSV or Text::CSV_XS for parsing CSV files.

      ummm... so he shouldn't use modules to calculate various statistics of data vectors, because that's simple, but he should use a module to split on a comma for him.. that doesn't make sense to me..

      Sorry, I don't mean to be bickering - in my opinion modules are useful whenever they make even a simple but repetitive task nice and short - more so, because they reduce the chance of an error...and occasionally I do get tired of writing the same code to get mean, variance, min, etc...

        but he should use a module to split on a comma for him..

        Parsing CSV files is not so simple as splitting on a comma, they can contain quoted data with commas inside or multiline records.

Re: Efficient use of memory
by tlm (Prior) on Jun 05, 2005 at 14:40 UTC

    The code below doesn't add much to ivancho's very nice implementation++, except that I roll my own class to take care of the as-you-go computation of the desired statistics. (Good thing you didn't want the median!)

    I thought it was a rare example of an OOP application that is both simple enough to be used as, say, a classroom illustration or a tutorial, and entirely useful "as is". Plus, it illustrates techniques that are being discussed in another thread.

    the lowliest monk

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://463483]
Approved by spurperl
Front-paged by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (1)
As of 2025-01-13 10:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which URL do you most often use to access this site?












    Results (28 votes). Check out past polls.