http://www.perlmonks.org?node_id=346630


in reply to Re: Bloom::Filter Usage
in thread Bloom::Filter Usage

I agree entirely with your principles and the Bloom filter is pretty much the last resort after having considered the other possibilities.

Hope that this helps to clarify things a bit

Update -- sorry, I should have mentioned that this is a cumulative monthly file containing updates to an arbitrary number of records as well as a set of new records that may or may not begin to contest for account numbers with older accounts... so the sqlldr method won't work (although it's a good one and occurred to me as well -- I might still to this route and, if I get desperate, resort to using a temporary table so that I *can* use sqlldr to do this). Or, to be more precise, it will only work the first time I load the file since after that the record will need updating from month to month so I can't just blast away at the Oracle constraints.

Replies are listed 'Best First'.
Re: Re: Re: Bloom::Filter Usage
by pelagic (Priest) on Apr 20, 2004 at 14:08 UTC
    That sounds interesting!

    I see following possibilities:
    • If your DB happend to be Oracle you could load your 30 Million Recs with a pretty fast tool (sqlloader) and let the tool write the duplicate key records to a specified "discarded-records-file". You could in a second step walk through your exeptions only and eventually update your DB.
    • Sort your file before you read sequentially through it. Sorting a biggie will take its time but afterwards you got all entries of one account grouped together. This would reduce your memory consumption.

    pelagic