Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Bloom::Filter Usage

by demerphq (Chancellor)
on Apr 20, 2004 at 18:31 UTC ( [id://346743]=note: print w/replies, xml ) Need Help??


in reply to Bloom::Filter Usage

Divide and conquor. You say you have two fields that are the key, one of which is unique. Take the right hand digit of the number and sort your records into 10 files by that digit. (Insert hand waving about it probably working out that this means you end up with roughly even size output files.) Now do your dupe checks on the resulting files. The thing to remember about perl hashes is that they grow in powers of two, that is they double when they are too small. So divide your file sufficiently that you stay within reasonable bounds. Divide by 10 has worked for me with equivelent sized data loads.

There are other approaches to this like using DB_File or some kind of RDBMS but I actually think overall you will have a simpler and probably more efficient system if you just use some kind of approach to scale the data down. Splitting data into bite sized chunks is an ancient and honorable programming tradition. :-)

Oh, another approach is to use a Trie of some sort. If your accounts are dense then overall it can be a big winner in terms of space and is very efficient in terms of lookup.


---
demerphq

    First they ignore you, then they laugh at you, then they fight you, then you win.
    -- Gandhi


Replies are listed 'Best First'.
Re(2): Bloom::Filter Usage
by bart (Canon) on Apr 20, 2004 at 19:04 UTC
    There are other approaches to this like using DB_File or some kind of RDBMS but I actually think overall you will have a simpler and probably more efficient system if you just use some kind of approach to scale the data down. Splitting data into bite sized chunks is an ancient and honorable programming tradition. :-)
    ... and it's exactly how DBM (and therefore, indexes for RDBM too) do their jobs. What else do you think binary search, B-trees, Beta-trees etc. are, but splitting up the date in ever smaller chunks?

    I don't think you'll be able to beat this kind of databases, in their own game.

      Er, maybe you mean this in a way I misunderstand but the algorithms that you are talking about dont split the data up. They apply an order to the data sure, and they deal with records more or less singly, but they dont split the data up.

      As for the other point, well, I guess unless somebody bothers to benchmark we wont know. :-) Unfortunately right now I dont have the time.


      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://346743]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-03-28 16:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found