Re: Re: Bloom::Filter Usage

in reply to Re: Bloom::Filter Usage
in thread Bloom::Filter Usage

I agree entirely with your principles and the Bloom filter is pretty much the last resort after having considered the other possibilities.

Use a DB -- I'm assuming that you mean a database instead of a DBM file, and this is ultimately where the data is going. The problem is that I essentially have a contested record against which I need to apply some kind of resolution logic later in the processing before I can determine which one belongs in the warehouse.
The keys are non-sequential and sparse so an array method won't work. And the generally expected condition is that each key will crop up once, and only once, so the delete method proposed in the other thread is also useless to me.
Read the file only once -- agreed. This file is 5GB of compressed plain-text. You can bet I'm not going through it a second time. ;)
Keep the memory allocation small -- again, over here in the cheap seats I completely agree. This is one of the really nice things about the Bloom filter: by basically doing a set analysis using a one-way key->bitmap conversion it's vastly more efficient than the same sized Perl hash. The only prices you pay are: 1) you can't pull the keys out afterwards, 2) you have to accept the risk of a false-positive match (which in my case is fine because I can do the expensive database work later over 25K keys intead of 25000K keys).

Hope that this helps to clarify things a bit

Update -- sorry, I should have mentioned that this is a cumulative monthly file containing updates to an arbitrary number of records as well as a set of new records that may or may not begin to contest for account numbers with older accounts... so the sqlldr method won't work (although it's a good one and occurred to me as well -- I might still to this route and, if I get desperate, resort to using a temporary table so that I *can* use sqlldr to do this). Or, to be more precise, it will only work the first time I load the file since after that the record will need updating from month to month so I can't just blast away at the Oracle constraints.

Comment on Re: Re: Bloom::Filter Usage

Replies are listed 'Best First'.
Re: Re: Re: Bloom::Filter Usage by pelagic (Priest) on Apr 20, 2004 at 14:08 UTC
That sounds interesting! I see following possibilities: • If your DB happend to be Oracle you could load your 30 Million Recs with a pretty fast tool (sqlloader) and let the tool write the duplicate key records to a specified "discarded-records-file". You could in a second step walk through your exeptions only and eventually update your DB. • Sort your file before you read sequentially through it. Sorting a biggie will take its time but afterwards you got all entries of one account grouped together. This would reduce your memory consumption. pelagic	[reply]

Replies are listed 'Best First'.

Re: Re: Re: Bloom::Filter Usage
by pelagic (Priest) on Apr 20, 2004 at 14:08 UTC

before

pelagic

[reply]

In Section Seekers of Perl Wisdom