Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: memory-efficient hash kind for incremental sort

by BrowserUk (Patriarch)
on Jan 07, 2009 at 01:48 UTC ( [id://734539]=note: print w/replies, xml ) Need Help??


in reply to memory-efficient hash kind for incremental sort

What do your 300 million keys look like? That is, what character set do they consist of and what range of lengths?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: memory-efficient hash kind for incremental sort
by Herkum (Parson) on Jan 07, 2009 at 03:13 UTC

    This sort of thing also made me bring up something I learn recently. There are only 1 million words in the english language, even if you are supporting several languages that would not be 300 million. Could there be something you are not considering when you are creating you key index?

      There are only 1 million words in the english language,

      Whilst true for some definitions of the term "english word", using just 6 character alpha "ids", 'AAAAAA' .. 'ZZZZZZ' gives 26**6 = 308,915,776 possibilities.

      And if the key words are (for example) genomic subsequences, the using just ACGT and 14 character subsequences can result in 268,435,456 possibilities.

      That's why I always ask questions when the data examples are so obviously made up. I seriously doubt there are 300 million male first names, even if you take all possible languages into account.

      Well, outside of taking native american names into consideration. They seemed to (according to the movies; I've no idea about the reality of the matter), use multiple words for names with no derivation from previous (parental) names. Then again, it's arguably possible that even if you totalled up every native american (those capable of speech) in history, they wouldn't total 300 million?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      :-)

      well, most of the dictionary are names and technical terms. I am building a citation data base, and need authors and titles. by 300 million, I am optimistic---it is a maximum layout for what I need to cover. many are duplicates...think "George" and "Michael" but also "Constantinopoulos" and "CAPM".

      preferably, I want to allocate 2GB of RAM to the task. I now will check into the solutions that were suggested. Thanks a lot to everyone.

      regards, /iaw

        To accumulate 300 million unique citations from 30,000 documents, you'd need to extract (on average) 10,000 from each document.

        Now I've read a few papers where the citation list was longer than the document, but even if every citation had a title and 9 authors, that would require every one of those 30,000 documents to reference 1,000 other (unique) documents. (In addition to any that it had in common with one or more other papers.)

        I want to allocate 2GB of RAM to the task.

        If every one of those 300 million terms was referenced by just one of the 30,000 papers, you'd need 2 bytes (a 16 bit integer) minimum for each value in your hash(like) data structure. That 600 MB right there without considering the keys or data-structure overhead.

        And, in Perl, that integer is going to mean at least an SvIV which is 16 bytes. Which makes your values require at least 4.4GB. For a normal perl hash, you'd need another 40GB for the keys!

        I think that you are way over estimating the number of keys, but even if you reduce that estimation by an order of magnitude, there is no way you will contain the information in a native hash in 2GB. (Nor any data structure built on top of them.)

        It seems pretty much clear that you will need to use some kind of permanent storage based DB.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Now citation databases are something I know a bit about. Is there a reason you are not using some of the existing art (e.g. EndNote, Reference Manager, ProCite ...)? What level of metadata are you trying to maintain? One thing you should certainly be aware of is the problem of author ambiguity - multiple individuals can be cited with the same string and multiple strings can represent the same individual - and that problem is not generally tractable without brute force effort and significant knowledge of the field. Are you essentially trying to implement an automated keyword generation?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://734539]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-03-29 13:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found