Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

A memory efficient hash, trading off speed - does it already exist?

by JPaul (Hermit)
on Feb 03, 2003 at 19:54 UTC ( #232380=perlquestion: print w/ replies, xml ) Need Help??
JPaul has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all;

As of late I've been fiddling with a rather simple AI chatterbot. Nothing remarkably clever, learns from input, generates sentences from keywords, so on, so forth. This particular bot has been around for a couple of months now and has managed to learn a great many things.
Recently I queried why it was taking up so much memory. Now I have a better understanding, and as its grown, I realise that I most certainly do need to attempt to use something different for storing the data in memory.

The data consists of two large hash of hashes, linking a keyword to potential following symbols and a weighting count. On disk, uncompressed, this takes up some 6.5M for some 120,000 possible word combinations. In memory this takes up 70M. Thats a big difference.
My understand is that perl, in its infinite and speedy wisdom, preallocates memory when it generates the hash, so when you add elements, you aren't constantly reallocating memory. As the word base has grown, it's become apparent that learning slows as he sees fewer and fewer things that he doesn't already know. Logical. This means he infrequently is adding new things to his data set. More importantly, of the 120,000 hashes, some 90% of them have less than 4 entries... Thats a lot of wasted preallocated memory.
So what I'm wondering is, is there an existing hash structure (That I can use, ala tie()) that doesn't preallocate? This way I'm only allocating memory that I actually need? I can stand the tradeoff in speed, right about now.

If this doesn't already exist, what suggestions do you have for building my own? (I'd figure I'd rip-off the existing built-in perl code for doing hashes and just have it not preallocate)...

My thanks
JP Hindin
-- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

Comment on A memory efficient hash, trading off speed - does it already exist?
Re: A memory efficient hash, trading off speed - does it already exist?
by perrin (Chancellor) on Feb 03, 2003 at 20:14 UTC
    It does exist. It's called a dbm file. Check out SDBM_File (fast, but only supports short hash values) or DB_File (slower, but supports unlimited hash values).
      A superlative idea... However, I didn't make my intentions apparent enough - my fault.
      At the moment the system uses such a large amount of memory that portions of the data set are paged off to disk. This slowdown (While, admittedly, being somewhat slower than simply pulling everything from disk, there is a comparison to be drawn with disk based data access) makes the project unuseable due to having to pull data off disk.

      I was willing to trade off speed in adding data to the hash, but not in doing an actual hash lookup.
      What I'm asking is quite likely unreasonable, but then again, thats why I put it to all you fine and smart folks - to show me where the problems in my plans lay.

      JP,
      -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

        You could try DB_File with it's in-memory option. It will store the keys as C strings, so they may be more compact.
        Have you actually tried tieing your hashes to a DBM file? You may find the speed hit is not anywhere near as bad as you think.

        I have DBM files with many millions of keys, and fetches are still lightning fast.

        Considering it's probably about 4 lines worth of code to try, give it a shot :-)

        I don't think it's even worth trying to draw conclusions about how fast a DBM file access will be by looking at the slowdown of a swapping process. Apples and Oranges.

Re: A memory efficient hash, trading off speed - does it already exist?
by Elian (Parson) on Feb 03, 2003 at 20:23 UTC
    You may find that a not-insignificant amount of the memory's being taken up by internal structure rather than extra preallocated space. SVs aren't small, and you've got a lot of pointers and pointers to pointers there. Not to mention hash key structs which are 8 bytes plus the key string, per key...

    There's not much you're going to do to reduce in-memory costs, I think, without a lot of extra speed hit. Use a DB, as hasn been suggested--Berkeley DB or even a plain NDBM/GDBM file will probably serve you well here.

      To reply to myself here... There's now Packed::Array for packed arrays. Not a hash, mind, but it might be useful to you in some circumstances. (Doing a packed hash would be rather more difficult, and haved much less of a memory win than packed arrays do)
Re: A memory efficient hash, trading off speed - does it already exist?
by steves (Curate) on Feb 03, 2003 at 22:33 UTC

    Getting back to your basic question: Can you avoid hash preallocation? Well, not really ... the hash has to be allocated to something. But you can allocate your own buckets by assigning to the keys function like this:

    keys(%hash) = 512;
    There are caveats -- Perl will round up to a power of 2 if needed.

    As far as I know, Perl maintains the initial hash allocation and does not reallocate along the way, but I may be wrong about that.

    There is no way to really use "just what you need" in a hash as you suggest ... The design of a hash guarantees that there will almost always be some unused slots unless you have a perfect hashing algorithm. That would be called an array. 8-)

Re: A memory efficient hash, trading off speed - does it already exist?
by bart (Canon) on Feb 03, 2003 at 22:59 UTC
    It sounds like you might want to use a tied hash, storing most of the data only on disk, most of the time. Another reply mentioned DB_File/DBM_File, and that is basically it. One problem is that tied hashes need to be, basically, flat datastructures. In English, this means that "you can't just throw a HoH at it, and expect it to just work" (sic). You need to turn the subhashes into scalars, and vice versa. Thus, look at MLDBM/Tie::MLDBM, which will take care of this nitty gritty for you.

    In case you're really after a way to keep the data in memory (what do you mean, so you don't need persistent data? ;-), the same concept might save you some memory: keep a plain flat hash as your main data storage, of which the values are frozen records, which you must first thaw (to use the Storable terminology) into a plain hash in order to access the deeper underlying data. That's right, you can use Storable for that.

    Somehow, I've got the gut feeling that this would imply reinventing MLDBM.

Re: A memory efficient hash, trading off speed - does it already exist?
by BrowserUk (Pope) on Feb 03, 2003 at 23:13 UTC

    Take a look at Tie::SubstrHash. I think that it's a part of the standard distribution. It's somewhat slower and a bit clumbsy to use, but you do get to control the memory it uses. How much it will save you will depend on your application.

    One tip. You have to specify size (in bytes) for the keys, the values and the total table size. This often isn't a problem for the keys but can make it awkward and less memory efficient than it might be if most of your values are short but you need to pre-allocate extra length to accomodate the occasional long value. You can work around this limitation by storing references to scalars instead of the scalar itself. This means you only need to allocate 4-byte per value, but that the real value can be any length. Of course it adds a dereference to the equation. The other benefit of this is that by using references, you don't have to pad the value before assigning.

    You will have to pad the keys to the pre-specified length though. Quite why this isn't done for you is not clear to me. However, it was a simple change to make to my copy of the module. I also made it so that over-long keys are automatically truncated. Slightly dangerous, but provided you ensure that the pre-allocated size is sufficient for your keys not to clash, it greatly simplifies using the module.

    Caveat implementor should you decide to make this change yourself.

    If the module works for you, it should be an easy adaption to your present code.


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      BrowserUk suggests that Tie::SubstrHash be used as a potential solution...

      I recommend that this solution not be considered. tie()'d objects have overhead. In the cheapest case, we would be replacing a small basic hash with less than 4 entries with a tied hash, and an attached blessed string reference. No gain would be realized, and performance would suffer.

        tie()'d objects have overhead

        True, but the OP said that performance was not a concern.

        ... would be replacing a small basic hash with less than 4 entries with a tied hash, and an attached blessed string reference. No gain would be realized ...

        My thought was to replace the 120_000 element hash with a Tie::SubstrHash which does make a considerable saving in space. However, as you can't use a Tie::SubstrHash to store references, that throws the idea in the bin.

        I have an incomplete and broken version of the standard module that attempts to work around this particular restriction, but in the absence of the abilty to build XS/Inline-C, I am unable to complete that. There's no guarentee I could bring the idea to fruition anyway.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: A memory efficient hash, trading off speed - does it already exist?
by Abigail-II (Bishop) on Feb 03, 2003 at 23:24 UTC
    Perl preallocates hashes with 8 buckets. That takes some space, but that's just 32 bytes. So, if you have 4 elements in the hash, you would be wasting 16 bytes on most hashes, 20 bytes on a few, and maybe 24 bytes on a handful of hashes. This may look a lot, but compared to the minimum amount of memory a hash takes, this isn't that much.

    perl -MDevel::Size -wle 'print Devel::Size::size ({})' 100

    That's 100 bytes for a reference to an empty hash. And the reference isn't taking the majority of the memory!

    If you want to save on your memory usage, you are way better off cutting down on your 120k hashes.

    Abigail

Re: A memory efficient hash, trading off speed - does it already exist?
by dragonchild (Archbishop) on Feb 03, 2003 at 23:34 UTC
    Abigail-II mentioned that you should reduce the number of hashes you are using. Another option is to use arrays instead of hashes. You can do some funkiness under the hood to try and maximize each array (as an array pre-allocates to like 50 elements at a time).

    One option is to create an object that uses an array as its implementation. I have a version and I'm sure that CPAN does, as well. You can go ahead and make it one big array for all the objects, having the class keep track of where each object starts in the super-array. There's a bit of funkiness to it, but it's definitely do-able. (/msg me if you want more info)

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      I tried this earlier on when it became apparent the project would be sucking up great quantities of memory - with very little success. In fact, I think it changed the memory usage by less than a few hundred kilobytes...
      This was some time ago, the data set has changed greatly. Perhaps I should try this again and would benefit more from it with a more diverse data set.

      I don't even begin to understand that second paragraph. It amazes me how much I can do in perl, and how remarkably little I truly understand about the workings of it all. How embaressing.

      My thanks,
      JP
      -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

        Let's say you know you will be making a hash of 120,000 hashes, each with (up to) 2 values. Instead of making 120,000 tiny hashes, with the overhead of each hash, you can go ahead and make 120,000 blessed scalar-refs and one huge array of 360,000 scalars. Each object will be a blessed scalar. Something like (untested!):
        package CoolObject; my @objects; my $next_index = 0; sub new { my $class = shift; my $x = $next_index; my $self = bless \$x, $class; $next_index += 3; return $self; } sub name { my $self = shift; $object[$$self] = shift if @_; return $object[$$self]; } sub attr1 { my $self = shift; $object[$$self + 1] = shift if @_; return $object[$$self + 1]; } sub attr2 { my $self = shift; $object[$$self + 2] = shift if @_; return $object[$$self + 2]; } 1;
        Those having read the Panther book will recognize this. Some obvious improvements would include:
        1. Adding the kind of garbage collection he had to re-use entries. (This would be in DESTROY.)
        2. Initialization of attributes. (This would be in new)
        3. Making it into more of a base class that will take attributes. (This would involve adding some sort of attribute registration.)
        4. Adding attribute inheritance. (This would involve attribute registration that walks the ISA-tree.)

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: A memory efficient hash, trading off speed - does it already exist?
by MarkM (Curate) on Feb 04, 2003 at 00:04 UTC

    Pre-allocation is not your problem. Consider that in a hash with 4 entries, the entire wastage is probably no more than 16 bytes per hash. 16 bytes X 120,000 words = 187.5 Kbytes. 187.5 Kbytes is nowhere near being a significant percentage of 70 Mbytes. What you are observing is the overhead required for each hash, each string, and even, each integer.

    Definately, as other people have suggested, moving the data to disk will deal with your memory problem. However, it does not deal with the size of the data itself. Even the most space-efficient *DBM_File modules are not able to compact the data on disk more than a few small percentages tighter than Perl packs the data in memory, and there will be a significant performance hit.

    I spent some time coming up with an alternative that stored the word=>score hash as a string of encoded 32-bit integer pairs (id,score), however, I finally decided that this is probably not a good approach unless one knows exactly what one is doing, and the best way to show that, is to implement it. I leave it to you to decide whether research in this area is desirable. Basically, you flatten the hash of hashes into 3 hashes of strings. The idscores string could encode the pairs in a manner that could be searched using a binary search, or perhaps a linear search is adequate. Ideas, ideas...

      Can you explain to me why the structures that store the data take over 10 times the size of the actual data itself? I believe you, I just don't understand why.
      What does storing the data in this format provide? Speedy STOREs and FETCHes? Is that it?

      I have a friend busily telling me that if this was C he could simply construct a linked-list struct and wander across the structs, which would undoubtedly take up less memory. While I'm sure thats not as magically efficient as he thinks, and I know the lookups would be phenomenally slow, its hard to convince him :)

      Thanks,
      JP
      -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

        The features that Perl data structures provide over minimal C data structures include at least the following:

        1. Reference counting to ensure prompt and reliable garbage collection. Once the reference count hits 0, the structure can be reclaimed.
        2. Data structures provide storage for multiple form values. For instance, if a string is used as an integer, and as a number, further accesses of the string as either a string, an integer or a number do not require for the string to be converted each time.
        3. In-place upgrading of types. For instance, if a type is blessed, or if a hash is tied, or if an integer becomes a string, all references (active references, not the same as Perl scalar references) remain valid. For comparison, consider the C linked list your friend mentions. Now change the list to a b-tree. Will all data structures that referred to the list now be ok referring to the b-tree? What if the b-tree data structure is bigger than the list data structure and requires being moved?

        Another thing to explain to your friend is that the HASH structure is a form of indexed linked list. First, the key is hashed into an integer form, then the integer is used to produce an index into the array of linked lists.

        The overhead for a standard Perl string is at least 24 bytes (on a 32-bit architecture). This is each of: a reference count (4 bytes), flags (4 bytes), a reference to a more specific data structure (4 bytes), a reference to the string content (4 bytes), the length of the string (4 bytes), the space allocated for the string as it is likely less than the length (4 bytes).

        The overhead for a standard Perl hash is at least: (again, for a 32-bit architecture)

        60 bytes + ((20 + length of average key) * number_of_keys)) + (4 * size of array used within hash) + (24 * length of average value, assuming all string values)

        When I referred to the amount you would save for not pre-allocating hash buckets, I was referring to "4 * size of array" number above, which is almost insignificant compared to the rest of the overhead. The above only describes a hash mapping strings to strings. In your case, you are mapping a string to a scalar reference to a hash that maps strings to numbers.

        Your specific case is pushing for title of 'application that best makes Perl look like a pig.' Usually, Perl is able to achieve significant performance gains by using more memory. In your case, the sheer size of the data structures likely results in the opposite effect.

Re: A memory efficient hash, trading off speed - does it already exist?
by Wysardry (Pilgrim) on Feb 04, 2003 at 01:23 UTC

    This may be a really dumb idea, but I've used a similar method before (in BASIC) to save memory space, back when 64Kb was considered huge.

    Couldn't you store the key/value pairs back to back in a single string, and then pull them out with a regex?

    The format could be something like "key1:value1|key2:value2|key3:value3|" allowing you to search for the key, and if found grab it and everything up to the next | character, then split on the : character to get the corresponding value.

    You could use whatever separators you wished, but there would need to be two different ones for this to work and to make it more readable for humans.

    Unfortunately, I'm not confident enough with the syntax of regular expressions (yet) to give an example, but from the sounds of it you know enough to work that out for yourself and to tell if I'm spouting complete nonsense.

    __________
    "Every program has at least one bug and can be shortened by at least one instruction -- from which, by induction, one can deduce that every program can be reduced to one instruction which doesn't work." -- (Author Unknown)

      Well, but then you run into the problem of if you want to use a colon or a pipe in one of your keys or values. I would immagine it would also be quite a bit slower for large amounts of key/value pairs than using a real hash.

        That's why I said that any separating characters could be used, meaning that ones that didn't clash could be picked. As a Perl program would most likely be used to convert the existing hashes, characters not normally directly accessible on the keyboard (above ASCII 127) could be used.

        The original poster also stated that speed wasn't an issue.

        It wouldn't be practical for every application, but might work in this instance. A similar method was used by several text adventure games on machines with limited memory or no array facilities.

      Indeed I am also considering something similar to this, although I'm not entirely sure if it would work quite as I hope.

      The data is constructed like so:

      %hash{Keyword} -> {Symbol_1} = Weighting %hash{Keyword} -> {Symbol_2} = Weighting
      And so on. I think perhaps I could slim down memory considerably by doing something like the following:
      %hash{Keyword} = "Symbol_1 | Weighting, Symbol_2 | Weighting"
      And, as Mr_Person has already noted, no I don't intend to use | and , as separators. Already when writing data to the disk as a "brainsave" I use \x255 as a record separator, just like you were thinking. Perhaps the fact that I started programming in BASIC as well might have something to do with our shared mindset? :)

      JP,
      -- Alexander Widdlemouse undid his bellybutton and his bum dropped off --

        If anything, you should probably switch to a structure like
        $hash{"Keyword\0Symbol_1"} = $Weighting[1]; $hash{"Keyword\0Symbol_2"} = $Weighting[2];

        That way you still get full hash lookup performance on the entire (two dimensional) key. And you avoid creating whole legions of hashes. That should pretty much fix 90% of your problem. And such a flat hash is trivial to tie to a DBM file.

        Don't knock DBMs without benchmarking them. DBM files are specifically laid out to allow for rapid retrieval - paging just indiscriminately swaps out whatever seems least necessary, including other parts of the system besides the data. Drawing any conclusions about one based on the other borders on ridiculous.

        Makeshifts last the longest.

Re: A memory efficient hash, trading off speed - does it already exist?
by toma (Vicar) on Feb 04, 2003 at 06:26 UTC
    Check out the soon-to-be-renamed Table-ParentChild by CPAN author MIKEWONG. It will store the type of relationship that you need in a more compact way. To use it, you would create an index for each string to be stored, and then store the relationships using a pair of indices.

    The 120,000 combinations should take about 16 bytes each, resulting in under 2MB of storage. Additional memory will be required for the word<->index lookup data structure.

    This code is fast, but not quite as fast as a hash. It uses much less memory. It works by using a linked list in C, which is written as an XS module.

    If it looks interesting, you could ask Mike to upload his latest version to CPAN. In the existing version, the relationship is stored as an integer. In the newer version, the relationship can optionally be a perl scalar. If you turn on the scalar option, it uses more more memory, of course.

    Update:Based on measurements from MarkM it looks like I may be optimistic by a factor of two (or more, depending on your data) in my memory estimates.

    It should work perfectly the first time! - toma

Re: A memory efficient hash, trading off speed - does it already exist?
by mce (Curate) on Feb 04, 2003 at 15:33 UTC
    Hi,

    I am not an expert on memory mgmt, but I know there is such a thing as pseudo hashes, which are in fact just arrays, where the first item is a hash that performs the indexing.

    They are not that flexible as normal hashes, but I think they have a large memory win against normal hashes (although I haven't tested this).

    It is worth checking it out, espacially with the fields pragma.
    ---------------------------
    Dr. Mark Ceulemans
    Senior Consultant
    IT Masters, Belgium

      Pseudo hashes are, as indicated by the adjective (?) 'pseudo', not really hashes.

      Pseudo hashes are a way of telling the compiler how to map keys to arrays, and translate all references to such keys, hard-coded (i think), into array indices. This allows easier building of staticly defined key sets for objects and configuration structures, etc, via the hash interface. The actual implementation is on an array, yielding lower memory requirements, and a speed increase. It's an array with a compile time overhead.

      -nuffin
      zz zZ Z Z #!perl
Re: A memory efficient hash, trading off speed - does it already exist?
by nothingmuch (Priest) on Feb 05, 2003 at 14:40 UTC
    I'm surprised caching has not been mentioned...

    Vocabularies have high occuring words, and low occuring words. It's safe to say that should you keep the 5,000-10,000 in memory, reading the rest of the words from disk as needed you should barely have a noticeable speed decrease.

    The way to implement such caching is called Least Recently Used expiration. You could use a cache on top of one of the before mentioned DBM modules. DB_File is my personal favourite, as it is flexible. BTrees are a bit more space efficient, but have a search time of around logx * N (N is the number of keys, x is the search time in a node), while hashes, taking up more space at times, have a typical search time of O(1), but may go as far as O(N). I would reccomend useing one of the modules in the Cache:: namespace on cpan, or searching for LRU, and layering a cache on a DB_File BTree.

    I have made a simple and relatively fast caching layer that accept a hash reference (tied hashes, i guess) and a limit for arguments, and maintains an O(1) expiration && storage time, albeit not necessarily filling the whole of the limit set on the memory hash. If you would like to view the source code I would be more than happy to share.

    Update: I forgot to mention that DB_File needs to be layered under MLDBM, and since freezing and thawing, aswell as searching, will cost time, a smaller set of hashes of hashes, cached, will yield reasonable time, with an alternative to paging.

    -nuffin
    zz zZ Z Z #!perl

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://232380]
Approved by traveler
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2014-10-22 01:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (112 votes), past polls