in reply to
Re: Bloom::Filter Usage
in thread Bloom::Filter Usage
I agree entirely with your principles and the Bloom filter is pretty much the last resort after having considered the other possibilities.
- Use a DB -- I'm assuming that you mean a database instead of a DBM file, and this is ultimately where the data is going. The problem is that I essentially have a contested record against which I need to apply some kind of resolution logic later in the processing before I can determine which one belongs in the warehouse.
- The keys are non-sequential and sparse so an array method won't work. And the generally expected condition is that each key will crop up once, and only once, so the delete method proposed in the other thread is also useless to me.
- Read the file only once -- agreed. This file is 5GB of compressed plain-text. You can bet I'm not going through it a second time. ;)
- Keep the memory allocation small -- again, over here in the cheap seats I completely agree. This is one of the really nice things about the Bloom filter: by basically doing a set analysis using a one-way key->bitmap conversion it's vastly more efficient than the same sized Perl hash. The only prices you pay are: 1) you can't pull the keys out afterwards, 2) you have to accept the risk of a false-positive match (which in my case is fine because I can do the expensive database work later over 25K keys intead of 25000K keys).
Hope that this helps to clarify things a bit
Update -- sorry, I should have mentioned that this is a cumulative monthly file containing updates to an arbitrary number of records as well as a set of new records that may or may not begin to contest for account numbers with older accounts... so the sqlldr method won't work (although it's a good one and occurred to me as well -- I might still to this route and, if I get desperate, resort to using a temporary table so that I *can* use sqlldr to do this). Or, to be more precise, it will only work the first time I load the file since after that the record will need updating from month to month so I can't just blast away at the Oracle constraints.