Re^3: memory-efficient hash kind for incremental sort

in reply to Re^2: memory-efficient hash kind for incremental sort
in thread memory-efficient hash kind for incremental sort

:-)

well, most of the dictionary are names and technical terms. I am building a citation data base, and need authors and titles. by 300 million, I am optimistic---it is a maximum layout for what I need to cover. many are duplicates...think "George" and "Michael" but also "Constantinopoulos" and "CAPM".

preferably, I want to allocate 2GB of RAM to the task. I now will check into the solutions that were suggested. Thanks a lot to everyone.

regards, /iaw

Comment on Re^3: memory-efficient hash kind for incremental sort

Replies are listed 'Best First'.
Re^4: memory-efficient hash kind for incremental sort by BrowserUk (Patriarch) on Jan 07, 2009 at 19:07 UTC
To accumulate 300 million unique citations from 30,000 documents, you'd need to extract (on average) 10,000 from each document. Now I've read a few papers where the citation list was longer than the document, but even if every citation had a title and 9 authors, that would require every one of those 30,000 documents to reference 1,000 other (unique) documents. (In addition to any that it had in common with one or more other papers.) I want to allocate 2GB of RAM to the task. If every one of those 300 million terms was referenced by just one of the 30,000 papers, you'd need 2 bytes (a 16 bit integer) minimum for each value in your hash(like) data structure. That 600 MB right there without considering the keys or data-structure overhead. And, in Perl, that integer is going to mean at least an SvIV which is 16 bytes. Which makes your values require at least 4.4GB. For a normal perl hash, you'd need another 40GB for the keys! I think that you are way over estimating the number of keys, but even if you reduce that estimation by an order of magnitude, there is no way you will contain the information in a native hash in 2GB. (Nor any data structure built on top of them.) It seems pretty much clear that you will need to use some kind of permanent storage based DB. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: memory-efficient hash kind for incremental sort by kennethk (Abbot) on Jan 07, 2009 at 17:17 UTC
Now citation databases are something I know a bit about. Is there a reason you are not using some of the existing art (e.g. EndNote, Reference Manager, ProCite ...)? What level of metadata are you trying to maintain? One thing you should certainly be aware of is the problem of author ambiguity - multiple individuals can be cited with the same string and multiple strings can represent the same individual - and that problem is not generally tractable without brute force effort and significant knowledge of the field. Are you essentially trying to implement an automated keyword generation?	[reply]

In Section Seekers of Perl Wisdom