Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Handling HUGE amounts of data

by Dandello (Monk)
on Jan 30, 2011 at 07:30 UTC ( [id://885103]=perlquestion: print w/replies, xml ) Need Help??

Dandello has asked for the wisdom of the Perl Monks concerning the following question:

Yeah, I'm back again. I finally got the parameters of the data that needs to be generated for this project. 100,000+ characters wide and 8400 lines, comma delimited file.

Needless to say, there are resource issues here. Just the first pass - pre-generating the necessary random numbers - created a 1+ Gig text file.

Has anyone here handled data this big? Any suggestions are welcome - are there some DB modules that would help?

The original Help request is at:This runs WAY too slow

Yeah, I probably shouldn't post things when my eyes are crossed.

Replies are listed 'Best First'.
Re: Handling HUGE amounts of data
by GrandFather (Saint) on Jan 30, 2011 at 09:01 UTC

    Perl has pretty good support for databases of many sorts. A good starting point is DBD:SQLite which provides a complete stand alone database. Use DBI for accessing it, or look to the many DB abstraction layers available through CPAN.

    True laziness is hard work
Re: Handling HUGE amounts of data
by tilly (Archbishop) on Jan 30, 2011 at 08:41 UTC
    I have handled more data than this. But without having a concrete idea of what you are trying to do with the data, it is hard to know what to say. I just have no useful context.

    Are you trying to serve a web page from a CGI program? That's a lost cause with this data volume. (Unless you're providing a zipped file. Maybe.) Are you planning to process the data in some useful way? Well then what my advice would be depends on what you are trying to do with it.

    But at a minimum, lines that are 100K wide seems like a red flag. How many fields do you have? Why do you have so many? Is it possible to organize it in a saner fashion?

    If I knew that, then I could make suggestions about, for instance, how to use a relational database to better organize your data. Or perhaps DBM::Deep is the right module for you. Or perhaps you should take an entirely different approach. If I had some context I could make better guesses.

Re: Handling HUGE amounts of data
by BrowserUk (Patriarch) on Jan 30, 2011 at 10:11 UTC

    I can't help but wonder why you are presenting such huge amounts of data via cgi? Do you really expect anyone to actually read this volume of data?

    I also wonder how you think using a DB module will help you?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Handling HUGE amounts of data
by Dandello (Monk) on Jan 30, 2011 at 10:53 UTC

    Thank you all for at least looking

    This is a population estimate table with a maximum population of (x axis) of 17000 for which each one has random number (6 digits) assigned to it (among other things) over 8400 years.

    I've already done some rearranging of subs (like generating the randoms one row at a time.)

    I was thinking some DB management might be helpful simply because I know those can get huge.

    What I will probably do is break the output into interlocking chunks so that each chunk comes in at 20-40 MB instead of one output file at 400 MB.

    When this project started, I was told it would be 2000 wide and 2000 tall - no problem. Then today I got the actual data - 17000 wide and 8400 tall.

    Luckily this is NOT a web app - I spent a week learning Perl/Tk so it could run from a C prompt.

      This produces a file of 8400 lines of 17000 random numbers ~1GB in a little under 7 minutes.

      #! perl -slw use strict; use Math::Random::MT qw[ rand ]; for ( 1 .. 8400 ) { print join ',', map 1e5 + int( rand 9e5 ), 1 .. 17000; } __END__ [11:04:48.11] C:\test>885103 > junk.dat [11:12:10.44] C:\test>dir junk.dat 30/01/2011 11:12 999,608,400 junk.dat

      I appreciate that your application is doing something more complicated in terms of the numbers produced, but my point is that creating a file this size isn't hard.

      So the question becomes, what problems are you having? What is it that your current code isn't doing? Basically, what is it that you are asking for help with? Because so far, that is completely unstated.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Here's the processing file a it stands today:

        What it's doing is giving me an 'out of memory' when processing though data that should generate a 2 dimensional 'array' of 17120 elements wide and 8400 lines long.

        If I cut the number of lines down to 1200, it gets to 'write_to_output', begins to print to the file then gives me an 'out of memory' at about line 750. It also may or may not go back to the C prompt.

        If I cut the lines down to 800, it processes everything and brings up 'table1' as it should.

        However, even when it's finished writing to '$datafileout', there seems to be a several second delay after closing the 'table1' notice and the C prompt comes back.

        I'm assuming that means that some process hasn't been closed out properly, but for the life of me, I can't see what it is. All the file handles are closed and it doesn't throw any warnings.

        This is a Lenovo desktop with XP Pro and 4 Gigs of RAM.

Re: Handling HUGE amounts of data
by biohisham (Priest) on Jan 30, 2011 at 17:23 UTC
    Kudos for learning Tk related stuff in such a short time...

    I suggest reading David Cross's Data Munging with Perl, while this may not answer your question in a direct manner, however, in the long run it will prove it's worth especially when it comes to dealing with different types of data...

    Best of luck..


    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .
Re: Handling HUGE amounts of data
by Dandello (Monk) on Jan 30, 2011 at 20:53 UTC

    Someone has suggested packing the data - which is probably a good suggestion if I could figure out how.

    Perhaps if I explain the flow.

    'popfileb' creates a 2d array with just 'a','x', and 'd' as the values. The values are assigned to each element row by row based on both the original inputted data and the data already written to the previous line. So, if there is a 'd' at $aod[4][5], then $aod[4][6] should also be a 'd', but if $aod[4][5] is an 'a', then $aod[4][6] has a (more or less) random chance of being assigned a 'd'.

    'model1' and 'model2' (only one is ever called per run) creates a 2d array (@aob). The data for each element in @aob is also dependent on the values in the line above as well as the values in the corresponding element in @aod AND then has a random number added to it.

    The values from @aod and @aob are then combined in 'write_to_output' so during the printing to file phase all 'a's in @aod are replaced with the corresponding values in @aob.

    So, how do I pack and unpack @aod one line (or one element) at a time? Again, I'm sure there's a simple way I'm not seeing, but I've never used pack/unpack before. I've never needed to.

    Updated: As an experiment, I tried to generate and print out the full 17000 x 8400 @aod - ran out of memory at line 2216 (74,016 kb).

    At least I now know where the bottle-neck is.

      Someone has suggested packing the data

      You forgot who?

      This doesn't attempt to perform your required processing, but just demonstrates that it is possible to have two 8400x17120 element datasets in memory concurrently, provided you use the right formats for storing them.

      From what you said, @aod only ever holds a single char per element, so instead of using a whole 64-byte scalar for each element, use strings of chars for the second level of @aod and use substr to access the individual elements.

      For @aob, you need only integers, so use Tie::Array::Packed for that. It uses just 4-bytes per element instead of 24, but as it is tied, you use it just as you would a normal array.

      Putting those two together, you can have both your arrays fully populated in memory and it uses around 1.2GB instead of 9GB as would be required with standard arrays:

      #! perl -slw use strict; use Tie::Array::Packed; #use Math::Random::MT qw[ rand ]; $|++; my @aod = map { 'd' x 17120; } 1 .. 8400; ## To access individual elements of @aod ## instead of $aod[ $i ][ $j ] use: ## substr( $aod[ $i ], $j, 1 ); my @aob; for ( 1 .. 8400 ) { printf "\r$_"; tie my @row, 'Tie::Array::Packed::Integer'; @row = map{ 1e5 + int( rand 9e5 ) } 1 .. 17120; push @aob, \@row; } ## For @aob use the normal syntax $aob[ $i ][ $j ] ## but remember that y0u can only store integers <>; print $aob[ $_ ][ 10000 ] for 1 .. 8400;

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        99.99% there - it ran out of memory when I hit the close button on the little Perl/Tk popup that comes up at the end to announce the data run was done.

        Converting @aod into a string was a big improvement, but so was finding an array that was hiding in a sub routine. Sometimes you're just too close to see things.

        Since I know the final user (my boy child) will want even more data, there's still a little more work to do.

        #model 1; sub popnum1 { ( $x, $y, $z ) = @_; if ( $y == 0 ) { $aob[$x][0] = $initial + $z; } else { if ( substr ($aod[ $y-1],$x,1) ne 'a' ) { $aob[$x][$y] = $initial + $z; } else { $aob[$x][$y] = $z + $aob[$x][ $y - 1 ]; } } return $aob[$x][$y]; }

        This is one version of the @aob generator. It's called only when the corresponding element in @aod is an 'a' (so it varies from one row to the next. $z is a freshly generated random number (floating point decimal plus or minus) - got rid of another memory eating array in favor of a single variable.

        So @aob is the last big array to be tamed. But I'm gaining on it.;)

      If the values in line N of @aod are dependent on the values in line N-1 plus a random factor plus the corresponding line in @aob, then all you ever need to do is keep three lines in memory: (1) the previously constructed line of @aod (2) the line currently being constructed (3) the corresponding line of @aob. Your algorithm would look something like this:

      • construct line N from N-1
      • combine line N of @aod with line N of @aob
      • pack line N of @aob and write out to file
      • pack line N-1 of @aod and write out to file
      • make line N of @aod the new "N-1", repeat until you are done.

      As for unpacking, that depends on what you are doing with the data. However, if you can come up with a way to use the data only one or two lines at a time, then even during the unpack phase you won't need much memory. Just read in the number of bytes per line, unpack, process, and discard.

Re: Handling HUGE amounts of data
by Dandello (Monk) on Feb 01, 2011 at 00:29 UTC

    Still having 'out of memory issues' even though it finishes its run on my machine.

    But it's not finishing on the machine it needs to run on. And since that one's six hours away, there's not much I can do about it except get the code working even better.

    Here's the initial data from tmp/datatrans.txt: 1|112|3|21|vertical|1|1296362070|0,400,800,1200,1600,2000,2400,2800,3200,3600,4000,4400,4800,5200,5600,6000,6400,6800,7200,7600,8000,8400,|112,448,448,671,112,336,112,560,112,1119,448,448,783,783,1902,3133,4812,4588,8729,9400,14995,5371,|

    UPDATE

    I've split the script into two and am using Tie::File on the second part to access the huge chunk of data one line at a time. So far so good. No more 'out of memory' with the data above.

    So here's the code as I have it now:

    Now, right now it still puts a huge amount of data into memory but I'm experimenting with writing the data into tmp/test.txt one line at a time using Tie::File. So any suggestions will help.

    And of course, there's still the issue of why my son's almost new Win 7 gaming machine isn't performing as well on this as my Win32 XP Pro machine.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://885103]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-20 01:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found