Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Hash key or value string reuse?

by QM (Vicar)
on Jun 15, 2009 at 21:18 UTC ( #771804=perlquestion: print w/ replies, xml ) Need Help??
QM has asked for the wisdom of the Perl Monks concerning the following question:

I've made a quick search, but didn't find what I needed. Please point out relevant references if you got 'em.

I have a very large hash tree structure, with a lot of duplicate strings as either keys or values (depends on the sub tree). I'm using DBM::Deep for persistence, but I also run the script in memory mode, so this issue is equally valid for both.

I'm looking to reduce the overall hash size by, for instance, creating a subhash with the common value strings, and having the values point to the common string sub hash. Does Perl already do something like this for me behind the scenes?

What would I do for common key strings (in subtrees, of course)?

[The hash contains both numeric and string data. String data is sometimes short, like "N/A", or a filename, or 200+ bit bitstring that can't be represented numerically. The bitstrings tend to be similar, and there's a lot of duplicates.]

-QM
--
Quantum Mechanics: The dreams stuff is made of

Comment on Hash key or value string reuse?
Re: Hash key or value string reuse?
by jethro (Monsignor) on Jun 15, 2009 at 23:32 UTC

    A bitstring that can't be represented numerically? Any bitstring can be *represented* numerically. I guess you mean your bitstrings are stored as binary data which can't be printed as is without producing garbage on the screen

    To answer your question, no, perl doesn't do memory optimization for hashes. There might be a module for that, but I doubt the general case is so easy to deal with

    I think an example would have helped. Do the similarities in strings start at the same position in the bitstring? Or do similarities at least start at the same position, but the position varies? If they are not encoded as raw binary data, how are they encoded?

    Generally your problem is to find common subexpressions fast. If a very suboptimal packing is enough you could pack always from the start of a bitstring and always the same length. Then a simple HashOfHashes can use all the common subexpressions as keys and the subtree as values. Even a few such HoHs with different lengths and starting bits would be possible. But then you either have to be careful to always add to and search in them in the same sequence or can't short circuit a search when you found a subexpression and no value.

    You probably found above solution already yourself, but when giving general information you usually get general solutions

      A bitstring that can't be represented numerically?
      I didn't say it well, but I meant that it can't be represented in native machine format, at least without some arbitrary precision extension. I could use vec, but didn't do so as it would require converting from string to vec and back, with the vec version just for storage (no direct bit lookup is needed). So they are stored as strings, which in Perl are arbitrary "precision".

      Perhaps I lead you astray, as I wasn't looking for any encoding or compression-type solution. The number of unique strings isn't so large, but the total number of strings stored in the hash is. So if there's a way to store the actual string once, and either Perl or the code points to it as needed, that would be great.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Hash key or value string reuse?
by Marshall (Prior) on Jun 16, 2009 at 05:02 UTC
    First, "I'm looking to reduce the overall hash size", my question is why? I've worked with some hashes that are like 200MB is size. How big is yours and why do you think you need to reduce its size?

    I will point out that any kind of structure that you'd want, say like a binary-tree or whatever can be implemented in Perl. In no way are you limited to the basic "built-in" data types.

    Of course there will be a trade-off between performance and memory size. In general if you have enough memory, don't make that tradeoff! Of course besides reducing speed, this adds complexity which normally is not good unless you really need it!

    If you have say a whole bunch of objects that you'd like to store more efficiently, then look at ObjectTemplate.

    If you are starting to get "into the guts", then I would recommend "Advanced Perl Programming" by Srinivasan.

    Anyway back to original question: the key and value are both stored. There isn't any super tricky thing that "re-uses" strings via a pointer to that string. If that happened, then it would be necessary to keep track of where that string was used before and then to "split it into 2 strings" if you changed it in one place and not in another. So if you use string "bcd" in one hash and "bcd" in another other hash (maybe a subhash or the first hash) as a key, that string "bcd" is replicated.

    Anyway this is getting into a complex subject and I go back to my question: why does this matter to you?

      How big is yours and why do you think you need to reduce its size?
      I have a hash on disk that exceeds 42GB. I use a hash to avoid duplicates. Walking the hash through DBM::Deep takes about 5x longer than the equivalent memory hash. Of course, I haven't tried it on one this big, so it's much likely worse.

      There isn't any super tricky thing that "re-uses" strings via a pointer to that string.
      Thanks. I was just wondering if Perl already did this for me. Since that's not the case, it might be worth a try.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://771804]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2014-07-12 08:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (238 votes), past polls