http://www.perlmonks.org?node_id=156941

mrbbking has asked for the wisdom of the Perl Monks concerning the following question:

I've been looking around the Monestary this morning, but have not found what I need. memory usage woes and datastructures and out of memory errors come close.

I wrote a data conversion program that reads input from several files, munges it around, and spits it out in a single standard file format. (Standard for my company's internal systems, that is...)

In the munging process, I create a wide and shallow hash of hashes. I fill it something like this:

$user_books{$user_id}{$ISBN} = $title;
I will have many ${user_id}s, each of which has an average of ten ${ISBN}s, each of which has exactly one ${title}.

I've been using it for several weeks, now, on relatively small input files - 'small' meaning 1,000 or fewer ${user_id}s. It's been very helpful where I work, and I'm getting questions along the lines of, "how many users can you convert, assuming the median number of books per user is ten?"

So, I need to know how much data I can handle. The only limitation I see is the amount of memory (real and virtual) available on the machine running the conversion, and this hash of hashes is the only long-lived thing of substantial size involved. Everything else is small, and scoped very narrowly.

I think I need to know how much memory each level of the hash requires, and then I can just do the math - how many records can I have before I run out of memory.
Am I asking the right question? If so, can anyone help me learn how to determine the memory requirements?

Replies are listed 'Best First'.
Re: How to Calculate Memory Needs?
by Fletch (Bishop) on Apr 05, 2002 at 15:08 UTC

    Grab the GTop module from CPAN. Use it to get the memory size of your process after you start up and have loaded whatever modules. Create one of your data structures and look at memory usage. Subtract. :)

    The mod_perl guide has some good info on this, and I think there were some articles on Apache Week by Stas (author of the mod_perl guide) as well recently. Granted that's skewed towards a mod_perl perspective, but it's a starting place.

Re: How to Calculate Memory Needs?
by joealba (Hermit) on Apr 05, 2002 at 16:10 UTC
    Would it be possible for you to store the data in a database table or some type of disk storage, rather than storing all of it in memory? That would make your program much more scalable...

    Otherwise, I'm surprised that I can't find some docs ANYWHERE that show how to calculate memory usage of data structures.



    Update: Check out Camel 3, p. 504. Maybe PERL_DEBUG_MSTATS will help?

    Update2: Hmm.. Who wants to play with some inline C and figure out a way to do a sizeof() to pull the memory usage of a hash? Is that possible?

    Update3: Here's da goodz: http://take23.org/docs/guide/performance.xml/4. It has GTop examples. Fletch was all over that from the start. :)

      Would it be possible for you to store the data in a database table or some type of disk storage, rather than storing all of it in memory? That would make your program much more scalable...

      That is not always possible, only in most cases. I once had to code something that had to parse over 200 MB of data, with absolutely no disk I/O - no ramdrives either. The box had 1 GB of ram, so that was not a problem.

      Unfortunately, I had to re-invent a wheel. There already was a data parser for the particular format, but the author only thought of scalability on machines with harddisks.

      When creating something that handles a lot of data, consider:

      • RAM is faster than disk
      • Disk usually has more free space
      • Letting the OS swap to disk can be faster than handling I/O yourself (not always true)
      • In-RAM filesystems can come in handy
      • Writing large chunks is faster (generate > /tmpfs/foo; mv /tmpfs/foo foo can be faster than generate > foo
      • Flush data BEFORE you run out of RAM.
      My workstation has 1 GB of RAM, and another gigabyte of swap space. I tend to slurp files, do things in memory, and spit out entire files at once. This drastically decreases time spent on coding, and increases speed (most of the time). :)

      U28geW91IGNhbiBhbGwgcm90MTMgY
      W5kIHBhY2soKS4gQnV0IGRvIHlvdS
      ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
      geW91IHNlZSBpdD8gIC0tIEp1ZXJk
      

      I figured I go with storing it on disk only if I can't do it in memory. I'm not sure right now exactly how I could use it from disk, but it works now if the whole thing is in memory. Here I must cite the virtue of laziness. (Though in this context, I'm not certain that it's still a virtue...)

      Found this in the Perl CD Bookshelf:

      *PERL_DEBUG_MSTATS
      Relevant only if Perl is compiled with the malloc function included with the Perl distribution (that is, if perl -V:d_mymalloc yields "define"). If set, this causes memory statistics to be displayed after execution. If set to an integer greater than one, also causes memory statistics to be displayed after compilation.
      ...but that gives me 'undef', so that's out for me.

      If your Inline C suggestion is possible, it's beyond my current capabilities... :-(
      So, I'm off to read all about GTop, then!

Re: How to Calculate Memory Needs?
by Zaxo (Archbishop) on Apr 05, 2002 at 16:32 UTC

    Assuming you have a database keyed by ISBN number, you can reduce the size by just saying

    push @{$user_books{$user_id}}, $ISBN;
    This gives you a HoA with just the book keys. Without the database you can still save space and normalize your data by keeping a hash over unique isbns.

    After Compline,
    Zaxo

Re: How to Calculate Memory Needs?
by perrin (Chancellor) on Apr 05, 2002 at 16:28 UTC
    You really can't do any better than some back-of-an-envelope guesswork for this sort of thing. How much memory will it take to process 10000 users? Measure it for 1000 and multiply by 10.

    Of course the best thing to do is just test it. Generate test input for 10000 users and see if it breaks. Keep adding more until it does. Then change your algorithm to use less memory if you need it to go further. Remember that you can almost always trade speed for memory.

Re: How to Calculate Memory Needs?
by mrbbking (Hermit) on Apr 05, 2002 at 20:53 UTC
    Thanks for all of your suggestions and pointers. In the interest of wrapping things up, and sharing what I learned in the process, here's what I did:
    1. Read up on GTop.
    The README on CPAN says,
    this package is a Perl interface to libgtop: http://home-of-linux.org/gnome/libgtop/
    (The link doesn't point anywhere, BTW). But my platform choices are Windows 2000 or Mac OS X - neither of which has gtop or libgtop on it. So, to use this module, I'd have to find a gtop port, or find a Linux box. Either of those are certainly possible, but far more effort than I'd hoped to expend. The link joealba provided has great usage samples for GTop for anyone interested. I was all ready to install it! :-(
    Next option...
    2. PERL_DEBUG_MSTATS
    My Perl was not compiled with support for this.
    Next option...
    3. Store the hash in a file
    This would also require re-designing the script. Until I have some idea that the limitations of doing it all in RAM are too restrictive, I'd rather not go through that work. Definitely my next choice if this doesn't scale well.
    Next option...
    4. If I can't use GTop, what could I use instead...
    Add sleep 10; at the beginning of the script, and again at the end. This gives me time to look at the memory used by perl.exe in the Win 2000 task list to see how much space the program takes before it builds any data structures at the beginning, and after it's built them all - and while they're still in scope - at the end. Then, I can take Fletch's advice and 'subtract.'
    This did the trick.
    The difference in memory taken up by perl.exe at the beginning of the script and after building the hash based on 160 users and 1600 books is 228K. Seems pretty teeny. At the moment, I have 61Meg free physical RAM. If this scales linearly (and I believe it does) that allows me to do about 42,500 users before even going to swap (theoretically). So this will scale just fine for the volumes I'm likely to encounter.

    So, I ended up going with a combination of Fletch and joealba's look at how much memory you're already using and do the arithmetic approach and perrin's Don't worry about precision, just play with it and be happy with a rough idea approach.

    Many thanks for all of your suggestions. This was very helpful.
    -- mrbbking

      228,000 bytes / 1600 books = 142.5 bytes used per book on average for storage in the hash.

      Subtract from 142.5 the average number of characters in the user ID AND the number of characters in an ISBN number. Then, you've got a reasonable estimate of the number of bytes of overhead that are used for each element in your hash.

      Then, you can make some good guesses about the upper memory bound on your program!
      I'm a bit late in this thread, but I have a similar problem: evaluate the size of a big hash that I have in memory.
      I searched perlmonks for some clue on this issue, and I found this thread. I've read your ideas, and I found them very interesting. Anyway I came up with a different approach, and I'd like to share my idea with you so that you can give me feedback.
      The idea is quite simple: I build the hash and then use Storable::freeze (see module Storable) to freeze it in memory, and then evaluate the length of it. I had this idea reading the man page of Storable module: here is an example from the man page:
      use Storable qw(store retrieve freeze thaw dclone); %color = ('Blue' => 0.1, 'Red' => 0.8, 'Black' => 0, 'White' => 1); $str = freeze(\%color); printf "Serialization of %%color is %d bytes long.\n", length($str);

      What do you think of this approach?

      marcos
        I don't know either way, but I would think there is differences in the size of the in memory data structures and the stored versions. Even so, it would probably give you a good ball park figure though you would use more memory as you have both structures in memory.

        -Lee

        "To be civilized is to deny one's nature."
      If you are accessing the data serially (and even if you're not) using a tied hash with MLDBM would make a good choice. The modifications would probably be simple. The biggest issue being you can't directly assign to sub elements of a complex data structure.
      $MLDBM_Hash{$user}{$subkey} = $value; # Won't work. my $hash = $MLDBM_Hash{$user}; # Will work. $hash->{$subkey} = $value; # Modify $MLDBM_Hash{$user} = $hash; # Reassign


      -Lee

      "To be civilized is to deny one's nature."
Re: How to Calculate Memory Needs?
by petdance (Parson) on Apr 06, 2002 at 04:06 UTC
    Unrelated to your question, but: You're using the Business::ISBN module for validating your ISBNs, I hope. (I've done a bunch of work on it, and I use it all the time for my day job at a book distributor)

    Any improvements and/or suggestions are always welcome.

    xoxo,
    Andy
    --
    <megaphone> Throw down the gun and tiara and come out of the float! </megaphone>