Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^3: Handling HUGE amounts of data

by Dandello (Scribe)
on Jan 30, 2011 at 17:32 UTC ( #885131=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Handling HUGE amounts of data
in thread Handling HUGE amounts of data

Here's the processing file a it stands today:

#!/usr/bin/perl # $Id: dataout $ # $Date: 1.27.11 $ # $HeadURL: adamant.net $ # $Revision: 2011 $ # $Source: /dataout.pl $ ###################################################################### +############ use strict; use warnings; use CGI::Carp; use lib 'C://testing/lib'; use List::Util qw(shuffle sum); use Math::Random::MT qw(srand rand); #use Math::Random::MT::Auto qw(rand); use Readonly; use Time::Local; use Tk; our $VERSION = 2.40; my ( $chntot, $countgen, $errchk, $ercount, $incrs, $incrsdel, $tota +l2, $tot2error, $x, $xb, $y, $yb, @aoa, @aob, @aod, @aox, $for1, $for2, $mw, ); Readonly my $PCNT => 100; Readonly my $C3 => 3; Readonly my $C4 => 4; Readonly my $C5 => 5; Readonly my $C60 => 60; Readonly my $C10 => 10; Readonly my $NEG => -1; Readonly my $YEAR => 1900; my $filename = 'tmp/datatrans.txt'; open my $DAT, '<', $filename or croak 'cannot open file'; my @dataa = <$DAT>; close $DAT or croak 'cannot close SFILE'; my @mpy = (); # year my @mpe = (); # population estimate my ( $model, $initial, $copyerr, $LST, $file, $format, $timein, $popyr, $popest, $nul ); #1|8|5|8|newt|2|1296234929|0,10,15,20,25,30,35,40,45,|5,10,15,10,15,25 +,20,30,20,| ( $model, $initial, $copyerr, $LST, $file, $format, $timein, $popyr, $popest, $nul ) = split /[|]/xsm, $dataa[0]; my @mpex = split /\,/xsm, $popest; my @mpyx = split /\,/xsm, $popyr; $#mpyx = $LST; $#mpex = $LST; foreach my $tst ( 0 .. $LST ) { $mpy[$tst] = $mpyx[$tst] || 1; $mpe[$tst] = $mpex[$tst] || 0; } $mpy[0] = 0; #set delimiters; if ( $format == 1 ) { $for1 = q{,}; $for2 = q{csv}; } elsif ( $format == 2 ) { $for1 = qq{\t}; $for2 = q{tab}; } else { $for1 = q{|}; $for2 = q{txt}; } #set file names my $datafileout = q{data/} . $file . q{output} . $for2 . q{.} . $for2; my $datafilein = 'data/' . $file . 'input' . $for2 . '.txt'; #date array my @months = qw( January February March April May June July August September October November December ); # CALCS my @grand = (0); my @popbyyer = ( $mpe[0] ); my @test = (0); for my $y ( 1 .. $LST ) { my $y_diff = ( $mpy[$y] - $mpy[ $y - 1 ] ) || 1; my $e_diff = ( $mpe[$y] - $mpe[ $y - 1 ] ); $popbyyer[$y] = $e_diff / $y_diff; $grand[$y] = $grand[ $y - 1 ] + ( ( $e_diff < 0 ) ? abs($e_diff) : + 0 ); if ($e_diff > 0){$test[$y] = $e_diff;} else {$test[$y] = 0;} } my $gener = ( sort { $a <=> $b } @mpy )[$NEG]; my $total = $mpe[0] + sum(@test); #build data table; my @dr0 = (); my @dr1 = (); my @dr2 = (); my @dr3 = (); my @dr4 = (); my @dr5 = (); $dr0[0] = 0; $dr3[0] = 0; $dr5[0] = 0; foreach my $tablem ( 1 .. $LST - 1 ) { $dr0[$tablem] = $mpy[$tablem] + 1; $dr3[$tablem] = $mpy[$tablem]; $dr5[$tablem] = $grand[$tablem]; } foreach my $tablen ( 0 .. $LST - 1 ) { $dr1[$tablen] = $mpy[ $tablen + 1 ]; $dr2[$tablen] = $mpe[$tablen]; $dr4[$tablen] = $popbyyer[ $tablen + 1 ]; } popfileb(); write_to_output(); table1(); #increase - decrease array; sub popfileb { my $countgen0 = 0; foreach my $dlp ( 0 .. $LST - 1 ) { foreach my $yb ( $dr0[$dlp] .. $dr1[$dlp] ) { $countgen = $countgen0++ - $dr3[$dlp]; my $change = $countgen * $dr4[$dlp]; $incrs = int( $dr2[$dlp] + $change ); $incrsdel = int( $dr5[$dlp] + abs $change ); $chntot = $incrsdel + $incrs; foreach my $xb ( 0 .. $total ) { #set first row; if ( $yb == 0 ) { if ( $xb < $dr2[$dlp] ) { $aod[$xb][0] = 'a'; } else { $aod[$xb][0] = 'n'; } } #set increasing rows; elsif ( $dr4[$dlp] >= 0 ) { if ( $xb < $dr5[$dlp] + $incrs ) { if ( $aod[$xb][ $yb - 1 ] eq 'd' ) { $aod[$xb][$yb] = 'd'; } else { $aod[$xb][$yb] = 'a'; } } else { $aod[$xb][$yb] = 'n'; } } #set decreasing rows; else { if ( $xb <= $chntot ) { if ( $aod[$xb][ $yb - 1 ] eq 'd' ) { $aod[$xb][$yb] = 'd'; } elsif ( $aod[$xb][ $yb - 1 ] eq 'a' ) { $aod[$xb][$yb] = 'a'; } else { $aod[$xb][$yb] = 'n'; } } else { $aod[$xb][$yb] = 'n'; } } } #set random decreased cell; if ( $dr4[$dlp] < 0 ) { #pre-populate deletion array; my @delarray = ( 0 .. $chntot ); my @dlarray = shuffle @delarray; my $cndiea = 0; foreach my $del ( 0 .. $chntot ) { if ( defined $aod[$del][$yb] && $aod[$del][$yb] eq + 'd' ) { $cndiea++; } } my $cnr = $cndiea; for my $xd ( 0 .. $chntot ) { my $xda = $dlarray[$xd]; if ( defined $aod[$xda][$yb] && $aod[$xda][$yb] eq 'a' && $cnr < $incrsdel ) { $aod[$xda][$yb] = 'd'; $cnr++; } } } } } my $error = 0; foreach my $mycheck ( 0 .. $total ) { if ( $aod[$mycheck][ $mpy[$NEG] ] eq 'd' ) { $error++; } $ercount = $error; } my $toterror = 0; foreach my $mycheck2 ( 0 .. $total ) { if ( $aod[$mycheck2][ $mpy[$NEG] ] eq 'n' ) { $toterror++; } $tot2error = $toterror; } return; } #model 1; sub popnum1 { ( $x, $y ) = @_; if ( $y == 0 ) { $aob[$x][0] = $initial + $aoa[$x]; } else { if ( $aod[$x][ $y - 1 ] ne 'a' ) { $aob[$x][$y] = $initial + $aoa[$x]; } else { $aob[$x][$y] = $aoa[$x] + $aob[$x][ $y - 1 ]; } } return $aob[$x][$y]; } #model 2; sub popnum2 { ( $x, $y ) = @_; my @delarray = ( 0 .. $total ); my @dlarray = shuffle @delarray; if ( $y == 0 ) { $aob[$x][0] = $initial + $aoa[$x][0]; } elsif ( $aod[$x][ $y - 1 ] eq 'n' ) { my $cnr = 0; for my $xd ( 0 .. $total ) { my $xda = $dlarray[$xd]; if ( $cnr < 2 && defined $aob[$xda][ $y - 1 ] ) { $aob[$x][$y] = $aoa[$x][$y] + $aob[$xda][ $y - 1 ]; $cnr++; } } } else { $aob[$x][$y] = $aoa[$x][$y] + $aob[$x][ $y - 1 ]; } return $aob[$x][$y]; } sub write_to_output { $total2 = $total - $tot2error; my $cell = 0; open my $DATABASE, '>', $datafileout or croak 'dataout not made.'; #flock $DATABASE, 2; foreach my $drp ( 0 .. $LST - 1 ) { foreach my $y ( $dr0[$drp] .. $dr1[$drp] ) { foreach my $xa ( 0 .. $total ) { my $copycop = ( $copyerr - int rand( 1 + 2 * $copyerr +) ) / $PCNT; $aoa[$xa] = $copycop; } for my $x ( 0 .. $total2 ) { if ( $aod[$x][$y] eq 'd' ) { $aox[$x][$y] = qq{d$for1}; } elsif ( $aod[$x][$y] eq 'a' ) { if ( $model == 1 ) { $cell = sprintf '%.2f', popnum1( $x, $y ); } else { $cell = sprintf '%.2f', popnum2( $x, $y ); } $aox[$x][$y] = qq{$cell$for1}; } else { $aox[$x][$y] = qq{x$for1}; } print {$DATABASE} $aox[$x][$y] or croak 'unable to pri +nt'; } print {$DATABASE} qq{\n} or croak 'unable to print'; print qq{Printing line $y of $mpy[$NEG]\n} or croak 'unable to print to screen'; } } close $DATABASE or croak 'data1 not closed.'; return; } sub table1 { #check that 'd' cell ammount matches; my $errorcheck = $ercount - $grand[$NEG]; if ( $errorcheck != 0 ) { $errchk = qq{Error check: Oops, you propably want to reload this page - off by $ +errorcheck.}; } else { $errchk = q{Okay}; } my $now = time; my ( $secs, $mina, $hrs, $dys, $mnths, $yrs, $wdays, $ydays, $isdsts, $mins, $secsb, $minab, $hrsb, $dysb, $mnthsb, $yrsb, $wdaysb, $ydaysb, $isdstsb, ); ( $secsb, $minab, $hrsb, $dysb, $mnthsb, $yrsb, $wdaysb, $ydaysb, +$isdstsb ) = localtime $timein; ( $secs, $mina, $hrs, $dys, $mnths, $yrs, $wdays, $ydays, $isdsts +) = localtime $now; my $yras = $yrsb + $YEAR; my $elap = sprintf q{%.2f}, ( $now - $timein ) / $C60; if ( $minab < $C10 ) { $mins = qq{0$minab}; } else { $mins = $minab; } my $tpop = $total2 + 1; open my $DATABASEIN, '>', $datafilein or croak 'datain not made.'; print {$DATABASEIN} qq{Model: $model\nInitial# $initial\nCopy Error +- $copyerr\n} or croak 'cannot print end head'; foreach my $chk ( 0 .. $LST ) { print {$DATABASEIN} qq {Year $mpy[$chk] - $mpe[$chk]\n} or croak 'cannot print check'; } print {$DATABASEIN} qq{Started at: $months[$mnthsb] $dysb, $yras at $hrsb:$mins\n} or croak 'cannot print time1'; print {$DATABASEIN} qq{Elapsed time: $elap minutes\n} or croak 'cannot print time2'; print {$DATABASEIN} qq{Width: $tpop cells\n$errchk} or croak 'cannot print time2'; close $DATABASEIN or croak 'data1 not closed.'; $mw = MainWindow->new; my $font = $mw->fontCreate( -family => '{MS Sans Serif}', -size => 10, -weight => 'bold' ); my $frame = $mw->Frame( -borderwidth => 2, -relief => 'groove', -background => 'white', ); my $lab0 = $frame->Label( -text => q{Population Model Form}, -font => [ '{MS Sans Serif}', '14', 'bold', ], -background => 'white', -foreground => 'red', -padx => 20, )->pack(); my $lab1 = $frame->Label( -text => qq{Completed in $elap minutes}, -font => $font, -background => 'white' )->pack(); my $button = $frame->Button( -text => q{Done}, -command => \&somesub )->pack(); my $labnul = $frame->Label( -text => q{ }, -font => $font, -background => 'w +hite' ) ->pack(); $frame->pack; MainLoop; return; } sub somesub { local $, = qq{\n}; $mw->destroy; return; } exit;

What it's doing is giving me an 'out of memory' when processing though data that should generate a 2 dimensional 'array' of 17120 elements wide and 8400 lines long.

If I cut the number of lines down to 1200, it gets to 'write_to_output', begins to print to the file then gives me an 'out of memory' at about line 750. It also may or may not go back to the C prompt.

If I cut the lines down to 800, it processes everything and brings up 'table1' as it should.

However, even when it's finished writing to '$datafileout', there seems to be a several second delay after closing the 'table1' notice and the C prompt comes back.

I'm assuming that means that some process hasn't been closed out properly, but for the life of me, I can't see what it is. All the file handles are closed and it doesn't throw any warnings.

This is a Lenovo desktop with XP Pro and 4 Gigs of RAM.


Comment on Re^3: Handling HUGE amounts of data
Download Code
Re^4: Handling HUGE amounts of data
by BrowserUk (Pope) on Jan 30, 2011 at 18:37 UTC
    'out of memory' when processing though data that should generate a 2 dimensional 'array' of 17120 elements wide and 8400 lines long.... XP Pro and 4 Gigs of RAM

    A 17120x8400 array of integers requires 4.5GB of ram.

    If your XP is running 32-bit, the most ram available to perl is 2GB and you will run out of memory.

    If you are running 64-bit windows and a 64-bit perl, then you will move into swapping, and processing will get horribly slow.

    Is it necessary for your algorithm that you build the entire data structure in memory?

    Or could you produce your data one row at a time and then write it to disk before starting on the next? The example I posted above does this and only requires 9MB of ram.

    If it is absolutely necessary to build the whole thing in memory before outputting it, then I would pack the integers in memory. An array of 8400 strings length 64k (17120*4-bytes) comes to only 1/2GB. Whilst unpacking/repacking the values to perform the required math would slow things down a little, it will still be much faster than alternatives.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      If it is absolutely necessary to build the whole thing in memory before outputting it, then I would pack the integers in memory.
      vec is a usable alternative to pack/unpack, in terms of functionality, though it might be a little more userfriendly in terms of API. Or not, depending on your personal preference.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://885131]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2014-09-21 15:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (172 votes), past polls