Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I agree entirely with your principles and the Bloom filter is pretty much the last resort after having considered the other possibilities.

  • Use a DB -- I'm assuming that you mean a database instead of a DBM file, and this is ultimately where the data is going. The problem is that I essentially have a contested record against which I need to apply some kind of resolution logic later in the processing before I can determine which one belongs in the warehouse.
  • The keys are non-sequential and sparse so an array method won't work. And the generally expected condition is that each key will crop up once, and only once, so the delete method proposed in the other thread is also useless to me.
  • Read the file only once -- agreed. This file is 5GB of compressed plain-text. You can bet I'm not going through it a second time. ;)
  • Keep the memory allocation small -- again, over here in the cheap seats I completely agree. This is one of the really nice things about the Bloom filter: by basically doing a set analysis using a one-way key->bitmap conversion it's vastly more efficient than the same sized Perl hash. The only prices you pay are: 1) you can't pull the keys out afterwards, 2) you have to accept the risk of a false-positive match (which in my case is fine because I can do the expensive database work later over 25K keys intead of 25000K keys).

Hope that this helps to clarify things a bit

Update -- sorry, I should have mentioned that this is a cumulative monthly file containing updates to an arbitrary number of records as well as a set of new records that may or may not begin to contest for account numbers with older accounts... so the sqlldr method won't work (although it's a good one and occurred to me as well -- I might still to this route and, if I get desperate, resort to using a temporary table so that I *can* use sqlldr to do this). Or, to be more precise, it will only work the first time I load the file since after that the record will need updating from month to month so I can't just blast away at the Oracle constraints.


In reply to Re: Re: Bloom::Filter Usage by jreades
in thread Bloom::Filter Usage by jreades

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-03-29 06:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found