Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

This is exactly the approach that I'm going to have to take -- and it fits rather well with the logic of the process (I'm dealing with an ETL (Extract, Transform, Load) system and there are multiple stages to each job).

FamousLongAgo very kindly sent me the updated v 0.2 code to try out and it definitely works, but unfortunately the way that I'm trying to use it doesn't because the number of duplicates is very low and the population very large. This helps me to end up with with a massive bitmap (431329181 bits) and ten hashing functions. If I knew more about hashing functions I might have been able to come up with a way to accelerate the hashing function by (as was suggested by others in this thread) optimising it for numeric keys of a preset size (12 bytes).

As it stood, however, the level of overhead reduced the filter to the point where it took five to ten seconds per key!

I really like the approach of using uniq -d and can only wish that it had occurred to me a couple of days ago since I would have managed to skip the banging of hand to head that just happened. There's enough memory and swap space to support sorting on 30 million records (this machine even has SyncSort installed).

Thank you everyone for your helpful tips and suggestions.

In reply to Re: Re: Bloom::Filter Usage by jreades
in thread Bloom::Filter Usage by jreades

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (3)
    As of 2019-02-16 04:58 GMT
    Find Nodes?
      Voting Booth?
      I use postfix dereferencing ...

      Results (95 votes). Check out past polls.