Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: Re: Re: Bloom::Filter Usage

by BrowserUk (Pope)
on Apr 22, 2004 at 16:51 UTC ( #347410=note: print w/replies, xml ) Need Help??

in reply to Re: Re: Bloom::Filter Usage
in thread Bloom::Filter Usage

Update: If anyone would care to explain why this post rates a downvote, I'd be pleased to learn.

If it would help, here is a badly named and almost undocumented module that may see light of day sometime.

It will allow you to create an on-the-fly lookup table of your 30,000,000 12-digit numbers using under 256MB of ram.

The performance isn't too bad at an average of 40 usecs per lookup in my test application below.

package Algorithm::BLT; use strict; use warnings; use List::Util qw[ reduce ]; our $VERSION = '0.01'; sub new{ my( %opts ) = @_; my( $cvtFmt, $keyFmt, $tailFmt ) = delete @opts{ qw[ convert key tail ] }; die "Usage: " . __PACKAGE__ . "::new( convert => fmt, key => fmt, tail => fmt );\n" unless 3 == grep{ defined } $cvtFmt, $keyFmt, $tailFmt; warn "Ignoring extraneous options: ", %opts if %opts; my %lookup; my $seen = sub { use bytes; my $n = pack( $cvtFmt, shift ); my $key = unpack $keyFmt, $n; my $tail = join '', unpack $tailFmt, $n; my $len = length $tail; if( exists $lookup{ $key } ) { no warnings; my $ref = \$lookup{ $key }; return 1 if $$ref =~ m[(?<=^.{$len})*\Q$tail\E]; } $lookup{ $key } .= $tail; return 0; }; return $seen unless wantarray; my $stats = sub{ return sprintf 'keys: %d bytes: %d', scalar keys %lookup, reduce{ $a += length $b } 0, values %lookup; }; my $hRef = sub { return \%lookup }; return $seen, $stats, $hRef; } 1; __END__ =head1 NAME Algorithm::BLT - Perl extension for creating very large, fast lookup t +ables. (BLT derives from Big Lookup Tables.) =head1 SYNOPSIS use Algorithm::BLT; my $seen = Algorith::BLT::new( convert => 'd', key => 'x4 a2', tail => 'a4 x2 a4' ); for my $thingsToLookUp ( ... ) { if( $seen->( $thingsToLookUp ) ) { # Seen it already } else{ # We got a new one } } =head1 DESCRIPTION Typically, perl programs use hashes to effect fast loopup tables. Whilst this is very simply to code and very fast to use, the trade-off + is that they use large volumes of memory. This module attempts to address the case where the program only requir +es the lookup facility without all the other features that a hash provide +s, like values associated with the keys or the ability to iterate the val +ues, and which are therefore just overhead for the pure lookup table usage. =head2 PARAMETERS Currently 3 named parameters are possible and required. Each takes the + form of pack/unpack format strings. =over 4 convert convert - this format is used to convert input values to their binary form for reduce storage footprint. For example, if the values to be st +ored in the lookup are 12 digit numbers, these can be reduce to 8 bytes by converting them to their 'double' binary floating point representation + using 'd' for this parameter. =over 4 key key - this format is used to extract the key from the converted values +. This should be chosen to give the best spread of values across the 'buckets' (hash elements) that are used as the first pass selection mechanism. In the case of double floats, the 5th and 6th bytes do this better than any others bytes, so the format used is 'x4 a2'; =over 4 tail Tail - this format is used to extract the rest of the data (the tail) + from the converted value for storage within the buckets. In the case of the double float, the tail is split into 2 parts, the first 4 bytes and the last two. The output of the format will be concatenated to for +m the secondary key. The format to use in the double float case is therefore 'a4 x2 a2'; =head2 EXPORT None. =head1 AUTHOR BrowserUk =head1 COPYRIGHT AND LICENSE Copyright (C) 2004 by BrowserUk This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.3 or, at your option, any later version of Perl 5 you may have available. =cut

My testcase.

#! perl -slw use strict; use List::Util qw[ first reduce ]; use Math::Random::MT qw[ rand ]; use Benchmark::Timer; use Algorithm::BLT; $| = 1; our $MAX ||= 1000000; my( $new, $exists ) = (0) x 2; my $seen = Algorithm::BLT::new( convert => 'd', key => 'x4 a2', tail => 'a4 x2 a2' ); my $T = new Benchmark::Timer; $T->start( 'test' ); for( 1 .. $MAX ) { my $n = int( rand 1_000_000_000_000 ); $seen->( $n ) ? $exists++ : $new++; printf "\r%10d : %10d", $exists, $new unless $_ %1000; } print $/; $T->stop( 'test' ); $T->report; printf 'Record mem. usage? '; <STDIN>; __END__ P:\test>346619 -MAX=30000000 104190 : 29895810 1 trial of test (1,252s total) Record mem. usage? 215,412 KB

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://347410]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2019-02-19 02:35 GMT
Find Nodes?
    Voting Booth?
    I use postfix dereferencing ...

    Results (101 votes). Check out past polls.