in reply to Bloom::Filter Usage
You might consider attacking the problem from the other end. That is, develop a list of possible duplicate accounts and then checking the complete list against it. I assume that you need to perform some processing on the record before inserting it even if it is not duplicate and probably some other processing if is might be a duplicate. Thus, most of the time on this program will be in processing the records. Using some additional parsing tools might represent a small cost in time to simplify the program considerably.
For example, you could create a new file, possibledups.txt, by parsing the original file using awk to get the account number, open and close dates. Pipe this result to sort and unique to get a (much smaller) list of possible duplicate accounts. Something like ...
gawk [some parse code] | sort | uniq -d > possibledups.txt
The processing script then, can read this duplicate file into a hash first. Then, as the script reads each record from the master file, it can compare those results against the hash of possible dups. That way, your code is spending most of its processing effort working on known unique records (which is probably simpler and faster than working on dups). In my experience, this approach can often simplify coding since the exception condition (duplicates) can be moved off to another subroutine.
ps. As a test, I generated a random file containing 30 million records and processed it using this pipe set in about 9 minutes (milage may vary).
PJ
For example, you could create a new file, possibledups.txt, by parsing the original file using awk to get the account number, open and close dates. Pipe this result to sort and unique to get a (much smaller) list of possible duplicate accounts. Something like ...
gawk [some parse code] | sort | uniq -d > possibledups.txt
The processing script then, can read this duplicate file into a hash first. Then, as the script reads each record from the master file, it can compare those results against the hash of possible dups. That way, your code is spending most of its processing effort working on known unique records (which is probably simpler and faster than working on dups). In my experience, this approach can often simplify coding since the exception condition (duplicates) can be moved off to another subroutine.
ps. As a test, I generated a random file containing 30 million records and processed it using this pipe set in about 9 minutes (milage may vary).
PJ
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Re: Bloom::Filter Usage
by jreades (Friar) on Apr 22, 2004 at 14:44 UTC | |
by BrowserUk (Patriarch) on Apr 22, 2004 at 16:51 UTC |
In Section
Seekers of Perl Wisdom