Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Scaling Hash Limits

by Endless (Beadle)
on Sep 19, 2013 at 13:32 UTC ( [id://1054853]=note: print w/replies, xml ) Need Help??


in reply to Re: Scaling Hash Limits
in thread Scaling Hash Limits

I'm going through a stock of twitter messages. There are around 180 million total, and I want to make sure I don't have duplicates (based on 12-digit id numbers).

Replies are listed 'Best First'.
Re^3: Scaling Hash Limits
by BrowserUk (Patriarch) on Sep 19, 2013 at 15:30 UTC
    12-digit id numbers

    Do you know the approximate lowest and highest numbers?

    If so, you can reduce the memory requirement for de-duping your 180 million ids from ~ 4 gigabytes to ~ 8 megabytes.

    And make it much faster also.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi BrowserUk, you're making a very interesting point, which nobody seems to have picked up so far. I can see two ways of significantly reducing the memory required for the task at hand. One is that we probably don't need to worry about 12-digit numbers if we can narrow down the ID range really needed to something of the order of, say, about 200 to 250 million numbers, which seems to be a reasonable hypothesis if the IDs are allocated sequentially by the system. Then once we have such a narrower range, we only basically need only one bit per number in the range, and we just need to set one bit if a particular number has already been seen (so 1 bit per number in the range). These two observations could drive the memory requirement to perhaps 30 megabytes, if I can still think clearly this late in the evening, but I can't see how to reduce this memory requirement to only 4 megabytes 8 megabytes. I would be grateful if you could enlighten me on this, and I am sure many others would benefit from these ideas.

      Update: corrected my error of inattention: BrowserUk said 8 megabytes, not 4 megabytes.

        but I can't see how to reduce this memory requirement to only 4 megabytes.

        I said ~ 8 megabytes not 4.

        In Scaling Hash Limits the OP said: "my simple hash of scalars (for duplicate detection) hits a size of over 55 million"

        55 million / 8 / 1024**2 = 6.55651092529296875 MB.

        He also mentions 180 million: 180e6 / 8 / 1024**2 = 21.457672119140625 MB. But that's before de-duping, the purpose of the exercise. But its possible his list contains no duplicates.

        Of course, looking around I see that twitter uses 64-bit numbers for their user ids. And that it 20 digits not 12. Then again, they are only just now claiming 500 million users which is: 500e6 / 8 / 1024**2 = 59.604644775390625 MB which should be handleable by any modern machine with ease.

        Of course, it is also possible that they do not use sequential numbers for their IDs, but rather the 64-bit number is a hash of some aspect of the account -- the name or similar -- in which case the idea won't work because 2**64 / 8 / 1024**3 ~= 2 billion GB.

        But if that were the case, the OP probably wouldn't be talking about "12-digit numbers".

        Of course, the OP also doesn't explicitly mention 'user' ids, just "ids", and given the number -- 180 million; roughly the number of twits per day currently -- these could be message ids; which probably are allocated sequentially?

        Had the OP deigned to answer a couple of questions, much of the uncertainty could have been resolved.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Scaling Hash Limits
by AnomalousMonk (Archbishop) on Sep 19, 2013 at 13:36 UTC

    Database.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1054853]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-07-22 09:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.