memory-efficient hash kind for incremental sort

iaw4 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: memory-efficient hash kind for incremental sort by dragonchild (Archbishop) on Jan 07, 2009 at 01:30 UTC
Look at CouchDB. It's kinda designed for just this problem. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: memory-efficient hash kind for incremental sort by Arunbear (Prior) on Jan 15, 2009 at 11:56 UTC
Do you have a favourite out of the several CouchDB clients on CPAN ?	[reply]
Re: memory-efficient hash kind for incremental sort by BrowserUk (Patriarch) on Jan 07, 2009 at 01:48 UTC
What do your 300 million keys look like? That is, what character set do they consist of and what range of lengths? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: memory-efficient hash kind for incremental sort by Herkum (Parson) on Jan 07, 2009 at 03:13 UTC
This sort of thing also made me bring up something I learn recently. There are only 1 million words in the english language, even if you are supporting several languages that would not be 300 million. Could there be something you are not considering when you are creating you key index?	[reply]
Re^3: memory-efficient hash kind for incremental sort by BrowserUk (Patriarch) on Jan 07, 2009 at 03:54 UTC
There are only 1 million words in the english language, Whilst true for some definitions of the term "english word", using just 6 character alpha "ids", 'AAAAAA' .. 'ZZZZZZ' gives 26**6 = 308,915,776 possibilities. And if the key words are (for example) genomic subsequences, the using just ACGT and 14 character subsequences can result in 268,435,456 possibilities. That's why I always ask questions when the data examples are so obviously made up. I seriously doubt there are 300 million male first names, even if you take all possible languages into account. Well, outside of taking native american names into consideration. They seemed to (according to the movies; I've no idea about the reality of the matter), use multiple words for names with no derivation from previous (parental) names. Then again, it's arguably possible that even if you totalled up every native american (those capable of speech) in history, they wouldn't total 300 million? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^3: memory-efficient hash kind for incremental sort by Anonymous Monk on Jan 07, 2009 at 15:50 UTC
:-) well, most of the dictionary are names and technical terms. I am building a citation data base, and need authors and titles. by 300 million, I am optimistic---it is a maximum layout for what I need to cover. many are duplicates...think "George" and "Michael" but also "Constantinopoulos" and "CAPM". preferably, I want to allocate 2GB of RAM to the task. I now will check into the solutions that were suggested. Thanks a lot to everyone. regards, /iaw	[reply]
Re^4: memory-efficient hash kind for incremental sort by BrowserUk (Patriarch) on Jan 07, 2009 at 19:07 UTC
Re^4: memory-efficient hash kind for incremental sort by kennethk (Abbot) on Jan 07, 2009 at 17:17 UTC
Re: memory-efficient hash kind for incremental sort by JavaFan (Canon) on Jan 07, 2009 at 01:30 UTC
You want a 'trie' datastructure - a well known datastructure from the literature. The Perl regexp engine uses one (in C). I'm not aware of a canned CPAN solution, but tries are a pretty simple datastructure.	[reply]
Re^2: memory-efficient hash kind for incremental sort by grinder (Bishop) on Jan 07, 2009 at 11:56 UTC
A good CPAN solution that I've played with in the past is Tree::Trie. I would discard Data::Trie as it doesn't do prefix lookups. There's also Text::Trie written by the illustrious ILYAZ. While I find it's good for seriously complex problems, it doesn't make the simple things easy. Lots of make-work code to set up all the necessary callbacks. • another intruder with the mooring in the heart of the Perl	[reply]
Re: memory-efficient hash kind for incremental sort by derby (Abbot) on Jan 07, 2009 at 11:08 UTC
What you want is an inverted index engine that allows wildcard querying. I haven't played with any of the ones on cpan but prefer something along the lines of solr or it's base lucene (cpan's Kinosearch is very similar). Those search engines are going to provide with a lot more than what you've actually requested (that may be a good thing ... or maybe not). -derby	[reply]
Re: memory-efficient hash kind for incremental sort by sundialsvc4 (Abbot) on Jan 08, 2009 at 14:39 UTC
I frankly suggest that you do it the old-fashioned way: use a disk-based sorting engine. Sorting can very-easily remove all duplicates in the process of doing the sort... and there you have it, your intended result. You see, the problem with “doing it in memory” is that, in the general case, “memory isn't.” In the general case, memory is a disk-drive. Even if you have gobs and gobs of RAM, every one of those page-faults cost money ... and your application also becomes a million-pound elephant that's going to be stepping-on everything nearby. I'd start by perusing CPAN's `Sort::External::Cookbook` and specifically also the paper, A Fresh Look at Perl Sorting, by Uri Guttman and Larry Rosler, that is referred-to therein. The advantage of this algorithm (especially when you ask the sort-engine to remove the duplicates for you as it goes), is that it is a generalized solution to the problem, and it is known to scale-up well. A good sorting engine (and CPAN, natcherly, is full of them...) will use internal sorting for small-sized lists and switch to external-sorting at some appropriate time. It usually also does some amount of filtering to remove fairly-adjacent duplicated keys from the input. All of this being done for you, and done opaquely. What this does is to completely eliminate “searching.” You get a unique copy of all the keys that are present (and identical keys, if you accept them, are always adjacent). You can instantly detect any gaps in the sequence. And it's much faster than you think.	[reply]
Re: memory-efficient hash kind for incremental sort by ruzam (Curate) on Jan 08, 2009 at 01:25 UTC
Is there any reason why you haven't moved the data into a (pick your flavour) database? Searching data in efficient ways is what SQL engines are really good at.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks