http://www.perlmonks.org?node_id=346619

jreades has asked for the wisdom of the Perl Monks concerning the following question:

I have a large file (30 million records) containing a two-field key that is supposed to be unique. Unfortunately, it isn't necessarily that way... let me try to explain:
  1. The file comes to me with these fields (amongst others):
    • Account Number
    • Account Open Date
    • Account Close Date
  2. Account numbers can be reused, but only if the account has been closed. Unless you have data issues. Which I do. :(
  3. So I need a quick way to check while processing (amongst up to 30 million unique keys) whether I have seen this key before (i.e. whether it's a dupe that needs checking).
  4. The normal hash lookup method $cache{$account} works, but with ever-decreasing performance and an ever-increasing memory profile (it got up to 1.4GB of memory usage).
  5. So a Perl monger suggested looking at Bloom filters. More on Bloom filters is available here: Perl.com

I've found the Bloom::Filter module in CPAN but can't get it to work and am also worried about what level of false-positives I'm facing.

My current code is:

my $bloom_filter = Bloom::Filter->new(error_rate => 0.001, capacity => + 30000000); if ($bloom_filter->check($account_number)) { ... do deduping ... } else { $bloom_filter->add($account_number); ... do something ... }

I'm looking for wisdom on two fronts:

Thanks

Replies are listed 'Best First'.
Re: Bloom::Filter Usage
by pelagic (Priest) on Apr 20, 2004 at 13:39 UTC
    Hi
    I don't know Bloom::Filter.
    Anyway if you have a real huge file there are some principles to follow.
    • if possible: use a DB
    • read the file only once
    • keep memory allocation small
    There was a somewhat similar question a couple of days ago here Parsing Huge File
    As mentionned in above node you might consider using some sort of a "state engine".

    pelagic

      I agree entirely with your principles and the Bloom filter is pretty much the last resort after having considered the other possibilities.

      • Use a DB -- I'm assuming that you mean a database instead of a DBM file, and this is ultimately where the data is going. The problem is that I essentially have a contested record against which I need to apply some kind of resolution logic later in the processing before I can determine which one belongs in the warehouse.
      • The keys are non-sequential and sparse so an array method won't work. And the generally expected condition is that each key will crop up once, and only once, so the delete method proposed in the other thread is also useless to me.
      • Read the file only once -- agreed. This file is 5GB of compressed plain-text. You can bet I'm not going through it a second time. ;)
      • Keep the memory allocation small -- again, over here in the cheap seats I completely agree. This is one of the really nice things about the Bloom filter: by basically doing a set analysis using a one-way key->bitmap conversion it's vastly more efficient than the same sized Perl hash. The only prices you pay are: 1) you can't pull the keys out afterwards, 2) you have to accept the risk of a false-positive match (which in my case is fine because I can do the expensive database work later over 25K keys intead of 25000K keys).

      Hope that this helps to clarify things a bit

      Update -- sorry, I should have mentioned that this is a cumulative monthly file containing updates to an arbitrary number of records as well as a set of new records that may or may not begin to contest for account numbers with older accounts... so the sqlldr method won't work (although it's a good one and occurred to me as well -- I might still to this route and, if I get desperate, resort to using a temporary table so that I *can* use sqlldr to do this). Or, to be more precise, it will only work the first time I load the file since after that the record will need updating from month to month so I can't just blast away at the Oracle constraints.

        That sounds interesting!

        I see following possibilities:
        • If your DB happend to be Oracle you could load your 30 Million Recs with a pretty fast tool (sqlloader) and let the tool write the duplicate key records to a specified "discarded-records-file". You could in a second step walk through your exeptions only and eventually update your DB.
        • Sort your file before you read sequentially through it. Sorting a biggie will take its time but afterwards you got all entries of one account grouped together. This would reduce your memory consumption.

        pelagic
Re: Bloom::Filter Usage
by thpfft (Chaplain) on Apr 20, 2004 at 15:40 UTC

    From the code, I think any string will do as a salt. All it's doing is passing it as the second argument to sha1(), and all that does is append it to the first argument before hashing. The reason it talks about salts plural is that instead of using different hashing algorithms, it gets its vector coordinates by reusing the same algorithm with a different addendum. Throw some arbitrary but consistent four letter words at it and see what happens.

    More generally, it seems like a neat approach but will require at least two passes to really work. Your second pass will have to weed out the false positives by rescanning the data set with a simple hash-based tally restricted to the output of the first pass. In that case you can choose the accuracy based on the number of false positives you can afford to screen for, and even 0.1% would give you a fairly manageable hash of 30,000 keys.

    Incidentally, I'm no pack() leader, but if I were you I would think about rolling your own using only some bits of Bloom::Filter. It's biased towards working with email addresses, which are very heterogeneous, and there may well be regularities in your data set (they are IDs, after all) that make other algorithms much more efficient and discriminatory than just applying and reapplying the same hashing routine.

    update: according to the perl.com article, with 30million keys and 0.1% accuracy, you need 10 hash functions to get the most memory-efficient filter. For 0.01% it's 13 functions and 30% more memory use, and so on:

    my ( $m, $k ) = optimise(30000000, 0.0001); print <<""; memory usage = $m hash functions = $k sub optimise { my ( $num_keys, $error_rate ) = @_; my $lowest_m; my $best_k = 1; foreach my $k ( 1..100 ) { my $m = (-1 * $k * $num_keys) / ( log( 1 - ($error_rate ** (1/$k)))); if ( !defined $lowest_m or ($m < $lowest_m) ) { $lowest_m = $m; $best_k = $k; } } return ( $lowest_m, $best_k ); }

      This is where things get very strange for me

      Here's the code I'm running as a test:

      #!/usr/bin/perl use strict; use Digest::SHA1 qw(sha1); use Bloom::Filter; my $filter = Bloom::Filter->new(error_rate => 0.0001, capacity => 100) +; my @salts; # None of these work # Option 1: # push @salts, "msis"; # Option 2: # for my $salt (0..3) { # push @salts, sub { sha1($salt,$_[0]) }; # } # Option 3: # for my $salt (0..3) { # push @salts, sha1($salt,$_[0]); # } $filter->set_salts(@salts); $filter->add("Foo"); $filter->add("Bar"); $filter->add("Baz"); print STDOUT ($filter->check("Bar") ? "Found" : "Not Found"), "\n"; print STDOUT ($filter->check("Bim") ? "Found" : "Not Found"), "\n"; exit;
      it looks like one of these should have worked, but when I run my test code I always get the following:

      jreades@sort:~>> ./test.pl Use of uninitialized value in numeric gt (>) at Bloom/Filter.pm line 1 +26. Argument "" isn't numeric in numeric eq (==) at Bloom/Filter.pm line 8 +6. Argument "@" isn't numeric in numeric eq (==) at Bloom/Filter.pm line +86. Found Argument "" isn't numeric in numeric eq (==) at Bloom/Filter.pm line 8 +6. Argument "" isn't numeric in numeric eq (==) at Bloom/Filter.pm line 8 +6. Found

      It's just possible that there's something wrong with the module itself, and I've emailed the author asking for any tips or tricks but I haven't heard back from him/her yet.

      And yes, I do think that the approach is pretty neat -- when someone suggested it and I did some reading it leapt out as a very low-cost way to perform a high-cost operation. And, as you said, even taking a fairly high false-positive rate of 0.1% you still end up with a tiny fraction of your original search space.

      I did notice the odd bias towards email addresses but figured it might not affect what I was trying to do. What algorithm would you suggest as an alternative for working with 12-digit numeric keys more efficiently?

      TIA

        On closer inspection, Bloom::Filter is a bit broken, unless I'm badly misreading it:

        * The add() method doesn't actually add anything, because the new() method doesn't initialise the $self->{contents} hashref. it needs to be changed so that either the add method assigns directly to $self->{contents}, or the new() method changes 'keys' to 'contents'. that's probably just a typo.

        * because $self->{contents} is not populated, _calculate_filter_length() always returns undef, so the default filter length is always used.

        * because $self->{contents} is not populated, build_filter doesn't build a filter anyway.

        * the use of == to compare bitstrings in check() is generating warnings, as you've seen. Its purpose is to test that ($a & $b) is the same as $a, ie that all the on bits in $a are also on in $b, and never mind the off bits. Someone better than me at pack(), ie anyone at all, will know how to test equivalence using the bitstrings themselves: meanwhile, you can make it work by testing with the unpacked versions.

        * And anyway, it's not incremental. Every time you add a new key, the whole filter is rebuilt by way of a loop on $self->{contents}. This makes it more or less useless for your purposes: you presumably want an iterative check($_) || add($_) mechanism that can be placed in the equivalent of a while(<HUGE_FILE>) loop. As it stands, Bloom::Filter will perform another loop across your whole dataset-so-far for each input line, which might slow things down a bit.

        You will need to roll your own, I think, unless the author can be persuaded to accommodate both approaches as well as fixing the bugs, but if you patch Bloom::Filter with this, at least your test script should work:

        --- Filter.orig.pm Tue Apr 20 20:01:37 2004 +++ Filter.pm Tue Apr 20 20:04:34 2004 @@ -25,7 +25,7 @@ min_length => 20, filter_length => 20, %params, - keys => {} + contents => {} }, $class; } @@ -52,7 +52,7 @@ my $list = $self->{contents}; foreach my $add ( @addresses ) { # convert email to SHA1 hash if necessary - $add = sha1_base64( $add ) if $add =~ /@/o; + #$add = sha1_base64( $add ) if $add =~ /@/o; $list->{$add}++; } $self->{reindex_flag} = 1; @@ -83,7 +83,11 @@ # A match occurs if every bit we check is on foreach my $key ( @keys ) { my $mask = $self->_make_bitmask( $key ); - push @result, ($mask == ( $mask & $self->{filter} )); + #push @result, ($mask == ( $mask & $self->{filter} )); + + my $m = unpack("b*", $mask); + push @result, ($m == unpack("b*", ($mask & $self->{filter}))) +; + } return ( wantarray() ? @result : $result[0] ); }
Re: Bloom::Filter Usage
by demerphq (Chancellor) on Apr 20, 2004 at 18:31 UTC

    Divide and conquor. You say you have two fields that are the key, one of which is unique. Take the right hand digit of the number and sort your records into 10 files by that digit. (Insert hand waving about it probably working out that this means you end up with roughly even size output files.) Now do your dupe checks on the resulting files. The thing to remember about perl hashes is that they grow in powers of two, that is they double when they are too small. So divide your file sufficiently that you stay within reasonable bounds. Divide by 10 has worked for me with equivelent sized data loads.

    There are other approaches to this like using DB_File or some kind of RDBMS but I actually think overall you will have a simpler and probably more efficient system if you just use some kind of approach to scale the data down. Splitting data into bite sized chunks is an ancient and honorable programming tradition. :-)

    Oh, another approach is to use a Trie of some sort. If your accounts are dense then overall it can be a big winner in terms of space and is very efficient in terms of lookup.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


      There are other approaches to this like using DB_File or some kind of RDBMS but I actually think overall you will have a simpler and probably more efficient system if you just use some kind of approach to scale the data down. Splitting data into bite sized chunks is an ancient and honorable programming tradition. :-)
      ... and it's exactly how DBM (and therefore, indexes for RDBM too) do their jobs. What else do you think binary search, B-trees, Beta-trees etc. are, but splitting up the date in ever smaller chunks?

      I don't think you'll be able to beat this kind of databases, in their own game.

        Er, maybe you mean this in a way I misunderstand but the algorithms that you are talking about dont split the data up. They apply an order to the data sure, and they deal with records more or less singly, but they dont split the data up.

        As for the other point, well, I guess unless somebody bothers to benchmark we wont know. :-) Unfortunately right now I dont have the time.


        ---
        demerphq

          First they ignore you, then they laugh at you, then they fight you, then you win.
          -- Gandhi


Re: Bloom::Filter Usage
by Anonymous Monk on Apr 20, 2004 at 23:52 UTC

    if you're running updates in batches don't forget about quickish stuff that might work.

    perl -le 'for(1..30_000_000){$x=int(rand(30_000_000));print $x;}' >/tm +p/randnums time sort -n /tmp/randnums > /tmp/randnumssorted real 2m0.819s user 1m52.631s sys 0m2.798s # used about 200m memory time uniq -c /tmp/randnumssorted > /tmp/randuniq real 0m11.225s user 0m8.520s sys 0m1.019s time sort -rn /tmp/randuniq >/tmp/randuniqsort real 1m0.062s user 0m41.569s sys 0m3.125s head /tmp/randuniqsort 10 7197909 10 6080002 10 2718836 10 21596579 9 8257184 9 8116236 9 7721800 9 7706211 9 7657721 9 7490738

    pull out your account numbers, sort/uniq to find duplicates. takes about 3 minutes and 200m memory.

    there's nothing wrong with going over the file twice if it makes it easier to process.

Re: Bloom::Filter Usage
by eXile (Priest) on Apr 20, 2004 at 20:26 UTC
      4. The normal hash lookup method $cache{$account} works, but with ever-decreasing performance and an ever-increasing memory profile (it got up to 1.4GB of memory usage).
    If you got plenty of diskspace and fast disks Cache::FileCache might be a solution as well. Should keep your memory-usage lower and doesn't have that much overhead.
Re: Bloom::Filter Usage
by periapt (Hermit) on Apr 21, 2004 at 17:15 UTC
    You might consider attacking the problem from the other end. That is, develop a list of possible duplicate accounts and then checking the complete list against it. I assume that you need to perform some processing on the record before inserting it even if it is not duplicate and probably some other processing if is might be a duplicate. Thus, most of the time on this program will be in processing the records. Using some additional parsing tools might represent a small cost in time to simplify the program considerably.

    For example, you could create a new file, possibledups.txt, by parsing the original file using awk to get the account number, open and close dates. Pipe this result to sort and unique to get a (much smaller) list of possible duplicate accounts. Something like ...

     gawk [some parse code] | sort | uniq -d > possibledups.txt

    The processing script then, can read this duplicate file into a hash first. Then, as the script reads each record from the master file, it can compare those results against the hash of possible dups. That way, your code is spending most of its processing effort working on known unique records (which is probably simpler and faster than working on dups). In my experience, this approach can often simplify coding since the exception condition (duplicates) can be moved off to another subroutine.

    ps. As a test, I generated a random file containing 30 million records and processed it using this pipe set in about 9 minutes (milage may vary).

    PJ

      This is exactly the approach that I'm going to have to take -- and it fits rather well with the logic of the process (I'm dealing with an ETL (Extract, Transform, Load) system and there are multiple stages to each job).

      FamousLongAgo very kindly sent me the updated v 0.2 code to try out and it definitely works, but unfortunately the way that I'm trying to use it doesn't because the number of duplicates is very low and the population very large. This helps me to end up with with a massive bitmap (431329181 bits) and ten hashing functions. If I knew more about hashing functions I might have been able to come up with a way to accelerate the hashing function by (as was suggested by others in this thread) optimising it for numeric keys of a preset size (12 bytes).

      As it stood, however, the level of overhead reduced the filter to the point where it took five to ten seconds per key!

      I really like the approach of using uniq -d and can only wish that it had occurred to me a couple of days ago since I would have managed to skip the banging of hand to head that just happened. There's enough memory and swap space to support sorting on 30 million records (this machine even has SyncSort installed).

      Thank you everyone for your helpful tips and suggestions.

        Update: If anyone would care to explain why this post rates a downvote, I'd be pleased to learn.

        If it would help, here is a badly named and almost undocumented module that may see light of day sometime.

        It will allow you to create an on-the-fly lookup table of your 30,000,000 12-digit numbers using under 256MB of ram.

        The performance isn't too bad at an average of 40 usecs per lookup in my test application below.

        My testcase.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail