http://www.perlmonks.org?node_id=885161


in reply to Handling HUGE amounts of data

Someone has suggested packing the data - which is probably a good suggestion if I could figure out how.

Perhaps if I explain the flow.

'popfileb' creates a 2d array with just 'a','x', and 'd' as the values. The values are assigned to each element row by row based on both the original inputted data and the data already written to the previous line. So, if there is a 'd' at $aod[4][5], then $aod[4][6] should also be a 'd', but if $aod[4][5] is an 'a', then $aod[4][6] has a (more or less) random chance of being assigned a 'd'.

'model1' and 'model2' (only one is ever called per run) creates a 2d array (@aob). The data for each element in @aob is also dependent on the values in the line above as well as the values in the corresponding element in @aod AND then has a random number added to it.

The values from @aod and @aob are then combined in 'write_to_output' so during the printing to file phase all 'a's in @aod are replaced with the corresponding values in @aob.

So, how do I pack and unpack @aod one line (or one element) at a time? Again, I'm sure there's a simple way I'm not seeing, but I've never used pack/unpack before. I've never needed to.

Updated: As an experiment, I tried to generate and print out the full 17000 x 8400 @aod - ran out of memory at line 2216 (74,016 kb).

At least I now know where the bottle-neck is.

Replies are listed 'Best First'.
Re^2: Handling HUGE amounts of data
by BrowserUk (Patriarch) on Jan 31, 2011 at 01:20 UTC
    Someone has suggested packing the data

    You forgot who?

    This doesn't attempt to perform your required processing, but just demonstrates that it is possible to have two 8400x17120 element datasets in memory concurrently, provided you use the right formats for storing them.

    From what you said, @aod only ever holds a single char per element, so instead of using a whole 64-byte scalar for each element, use strings of chars for the second level of @aod and use substr to access the individual elements.

    For @aob, you need only integers, so use Tie::Array::Packed for that. It uses just 4-bytes per element instead of 24, but as it is tied, you use it just as you would a normal array.

    Putting those two together, you can have both your arrays fully populated in memory and it uses around 1.2GB instead of 9GB as would be required with standard arrays:

    #! perl -slw use strict; use Tie::Array::Packed; #use Math::Random::MT qw[ rand ]; $|++; my @aod = map { 'd' x 17120; } 1 .. 8400; ## To access individual elements of @aod ## instead of $aod[ $i ][ $j ] use: ## substr( $aod[ $i ], $j, 1 ); my @aob; for ( 1 .. 8400 ) { printf "\r$_"; tie my @row, 'Tie::Array::Packed::Integer'; @row = map{ 1e5 + int( rand 9e5 ) } 1 .. 17120; push @aob, \@row; } ## For @aob use the normal syntax $aob[ $i ][ $j ] ## but remember that y0u can only store integers <>; print $aob[ $_ ][ 10000 ] for 1 .. 8400;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      99.99% there - it ran out of memory when I hit the close button on the little Perl/Tk popup that comes up at the end to announce the data run was done.

      Converting @aod into a string was a big improvement, but so was finding an array that was hiding in a sub routine. Sometimes you're just too close to see things.

      Since I know the final user (my boy child) will want even more data, there's still a little more work to do.

      #model 1; sub popnum1 { ( $x, $y, $z ) = @_; if ( $y == 0 ) { $aob[$x][0] = $initial + $z; } else { if ( substr ($aod[ $y-1],$x,1) ne 'a' ) { $aob[$x][$y] = $initial + $z; } else { $aob[$x][$y] = $z + $aob[$x][ $y - 1 ]; } } return $aob[$x][$y]; }

      This is one version of the @aob generator. It's called only when the corresponding element in @aod is an 'a' (so it varies from one row to the next. $z is a freshly generated random number (floating point decimal plus or minus) - got rid of another memory eating array in favor of a single variable.

      So @aob is the last big array to be tamed. But I'm gaining on it.;)

        So @aob is the last big array to be tamed.

        Did you try Tie::Array::Packed?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Well, it still throws an 'out of memory' when I close the little Perl/Tk that announces the script has finished running.

        I assume I've done this right as BrowserUk suggested using Tie::Array::Packed to save on RAM:

        tie @aob, 'Tie::Array::Packed::DoubleNative'; #model 1; sub popnum1 { ( $x, $y, $z ) = @_; if ( $y == 0 ) { $aob[$x][0] = $initial + $z; $zaza = $aob[$x][0]; } else { if ( substr( $aod[ $y - 1 ], $x, 1 ) ne 'a' ) { $aob[$x][$y] = $initial + $z; $zaza = $aob[$x][$y]; } else { $aob[$x][$y] = $z + $aob[$x][ $y - 1 ]; $zaza = $aob[$x][$y]; } } return $zaza; }

        I figure that returning a single variable ($zaza)is more efficient than returning $aob[$x][$y] - it's hard to tell.

Re^2: Handling HUGE amounts of data
by ELISHEVA (Prior) on Jan 30, 2011 at 21:52 UTC

    If the values in line N of @aod are dependent on the values in line N-1 plus a random factor plus the corresponding line in @aob, then all you ever need to do is keep three lines in memory: (1) the previously constructed line of @aod (2) the line currently being constructed (3) the corresponding line of @aob. Your algorithm would look something like this:

    • construct line N from N-1
    • combine line N of @aod with line N of @aob
    • pack line N of @aob and write out to file
    • pack line N-1 of @aod and write out to file
    • make line N of @aod the new "N-1", repeat until you are done.

    As for unpacking, that depends on what you are doing with the data. However, if you can come up with a way to use the data only one or two lines at a time, then even during the unpack phase you won't need much memory. Just read in the number of bytes per line, unpack, process, and discard.