Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Re: Bloom::Filter Usage

by jreades (Friar)
on Apr 22, 2004 at 14:44 UTC ( #347375=note: print w/replies, xml ) Need Help??


in reply to Re: Bloom::Filter Usage
in thread Bloom::Filter Usage

This is exactly the approach that I'm going to have to take -- and it fits rather well with the logic of the process (I'm dealing with an ETL (Extract, Transform, Load) system and there are multiple stages to each job).

FamousLongAgo very kindly sent me the updated v 0.2 code to try out and it definitely works, but unfortunately the way that I'm trying to use it doesn't because the number of duplicates is very low and the population very large. This helps me to end up with with a massive bitmap (431329181 bits) and ten hashing functions. If I knew more about hashing functions I might have been able to come up with a way to accelerate the hashing function by (as was suggested by others in this thread) optimising it for numeric keys of a preset size (12 bytes).

As it stood, however, the level of overhead reduced the filter to the point where it took five to ten seconds per key!

I really like the approach of using uniq -d and can only wish that it had occurred to me a couple of days ago since I would have managed to skip the banging of hand to head that just happened. There's enough memory and swap space to support sorting on 30 million records (this machine even has SyncSort installed).

Thank you everyone for your helpful tips and suggestions.

Replies are listed 'Best First'.
Re: Re: Re: Bloom::Filter Usage
by BrowserUk (Pope) on Apr 22, 2004 at 16:51 UTC

    Update: If anyone would care to explain why this post rates a downvote, I'd be pleased to learn.

    If it would help, here is a badly named and almost undocumented module that may see light of day sometime.

    It will allow you to create an on-the-fly lookup table of your 30,000,000 12-digit numbers using under 256MB of ram.

    The performance isn't too bad at an average of 40 usecs per lookup in my test application below.

    My testcase.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://347375]
help
Chatterbox?
[choroba]: Low-end is defined as the Perl that generates millions of income
[choroba]: sorry, that's high-end, of coursse
[choroba]: low end, in my talk, will be code that "we don't touch because it works" and noone knows why
[choroba]: I want to present the most bizzare bugs and misfeatures I met when working for a large financial institution
[choroba]: I already gave a similar talk to my friends in a pub and at an internal conference at work and people liked it, so maybe...
[choroba]: LanX: That's the heritage, I can't do anything else
[RonW]: Sounds like some system my employer has "It does exactly what we need it to do and can't afford to risk anything we can't prove is 100% compatible"
[marto]: choroba sounds interesting
[RonW]: james28909 Why not write a Perl program to do the task?
[choroba]: RonW Yes, but then, one day, they needed to switch from FTP to SFTP, and... but I can't give the whole talk away here :)

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (11)
As of 2017-05-22 21:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?