Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
No such thing as a small change

memory-efficient hash kind for incremental sort

by iaw4 (Scribe)
on Jan 07, 2009 at 00:57 UTC ( #734532=perlquestion: print w/ replies, xml ) Need Help??
iaw4 has asked for the wisdom of the Perl Monks concerning the following question:

dear monks---I have an odd need. I want to do an incremental search on words that sit in many files. so, first I form a hash, such as
michael => "file1.txt" mike => "file1.txt,file30.txt" ...
now, I would like to see all keys matching a subset, such as
my %mi_results = $myhashlike{ "mi*" };
this is not hard if the hash is small. first, put all the keys into an array, then do a grep-match on the keys, and then extract the results from %myhashlike.

unfortunately, I may have up to 300 million words (keys) from 30,000 files in my hash.

what's a good solution for this sort of problem? are there data bases that allow regex key searches that would be suitable (esp. if they can cache intelligently)? any perl solutions? is there such a thing as a memory-efficient (say, read-only squeezed) hash?

advice appreciated.


Comment on memory-efficient hash kind for incremental sort
Select or Download Code
Re: memory-efficient hash kind for incremental sort
by dragonchild (Archbishop) on Jan 07, 2009 at 01:30 UTC
    Look at CouchDB. It's kinda designed for just this problem.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Do you have a favourite out of the several CouchDB clients on CPAN ?
Re: memory-efficient hash kind for incremental sort
by JavaFan (Canon) on Jan 07, 2009 at 01:30 UTC
    You want a 'trie' datastructure - a well known datastructure from the literature. The Perl regexp engine uses one (in C).

    I'm not aware of a canned CPAN solution, but tries are a pretty simple datastructure.

      A good CPAN solution that I've played with in the past is Tree::Trie. I would discard Data::Trie as it doesn't do prefix lookups.

      There's also Text::Trie written by the illustrious ILYAZ. While I find it's good for seriously complex problems, it doesn't make the simple things easy. Lots of make-work code to set up all the necessary callbacks.

      • another intruder with the mooring in the heart of the Perl

Re: memory-efficient hash kind for incremental sort
by BrowserUk (Pope) on Jan 07, 2009 at 01:48 UTC

      This sort of thing also made me bring up something I learn recently. There are only 1 million words in the english language, even if you are supporting several languages that would not be 300 million. Could there be something you are not considering when you are creating you key index?

        There are only 1 million words in the english language,

        Whilst true for some definitions of the term "english word", using just 6 character alpha "ids", 'AAAAAA' .. 'ZZZZZZ' gives 26**6 = 308,915,776 possibilities.

        And if the key words are (for example) genomic subsequences, the using just ACGT and 14 character subsequences can result in 268,435,456 possibilities.

        That's why I always ask questions when the data examples are so obviously made up. I seriously doubt there are 300 million male first names, even if you take all possible languages into account.

        Well, outside of taking native american names into consideration. They seemed to (according to the movies; I've no idea about the reality of the matter), use multiple words for names with no derivation from previous (parental) names. Then again, it's arguably possible that even if you totalled up every native american (those capable of speech) in history, they wouldn't total 300 million?

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        well, most of the dictionary are names and technical terms. I am building a citation data base, and need authors and titles. by 300 million, I am optimistic---it is a maximum layout for what I need to cover. many are duplicates...think "George" and "Michael" but also "Constantinopoulos" and "CAPM".

        preferably, I want to allocate 2GB of RAM to the task. I now will check into the solutions that were suggested. Thanks a lot to everyone.

        regards, /iaw

Re: memory-efficient hash kind for incremental sort
by derby (Abbot) on Jan 07, 2009 at 11:08 UTC

    What you want is an inverted index engine that allows wildcard querying. I haven't played with any of the ones on cpan but prefer something along the lines of solr or it's base lucene (cpan's Kinosearch is very similar). Those search engines are going to provide with a lot more than what you've actually requested (that may be a good thing ... or maybe not).

Re: memory-efficient hash kind for incremental sort
by ruzam (Curate) on Jan 08, 2009 at 01:25 UTC
    Is there any reason why you haven't moved the data into a (pick your flavour) database? Searching data in efficient ways is what SQL engines are really good at.
Re: memory-efficient hash kind for incremental sort
by sundialsvc4 (Monsignor) on Jan 08, 2009 at 14:39 UTC

    I frankly suggest that you do it the old-fashioned way: use a disk-based sorting engine.

    Sorting can very-easily remove all duplicates in the process of doing the sort... and there you have it, your intended result.

    You see, the problem with “doing it in memory” is that, in the general case, “memory isn't.” In the general case, memory is a disk-drive. Even if you have gobs and gobs of RAM, every one of those page-faults cost money ... and your application also becomes a million-pound elephant that's going to be stepping-on everything nearby.

    I'd start by perusing CPAN's Sort::External::Cookbook and specifically also the paper, A Fresh Look at Perl Sorting, by Uri Guttman and Larry Rosler, that is referred-to therein.

    The advantage of this algorithm (especially when you ask the sort-engine to remove the duplicates for you as it goes), is that it is a generalized solution to the problem, and it is known to scale-up well. A good sorting engine (and CPAN, natcherly, is full of them...) will use internal sorting for small-sized lists and switch to external-sorting at some appropriate time. It usually also does some amount of filtering to remove fairly-adjacent duplicated keys from the input. All of this being done for you, and done opaquely.

    What this does is to completely eliminate “searching.” You get a unique copy of all the keys that are present (and identical keys, if you accept them, are always adjacent). You can instantly detect any gaps in the sequence. And it's much faster than you think.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://734532]
Approved by planetscape
Front-paged by Arunbear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2014-04-19 13:06 GMT
Find Nodes?
    Voting Booth?

    April first is:

    Results (480 votes), past polls