in reply to
Using text files to remove duplicates in a web crawler
If you're going to be processing many thousands of URLs (25K seems like it satisfies this condition), you may want to try something called a Bloom Filter, for which a sample module already exists on CPAN (Bloom::Filter).
A Bloom Filter a simplified hashtable that can test only presence/absence (unlike a full-blown hashtable which can associate arbitrary data, such as a scalar, with each entry). They can encode this information in FAR less space than a normal hashtable.
Be warned! Bloom Filters are not perfect; they have a very small, but existent error rate of false positives/negatives that occur. By increasing the amount of storage per entry (decreasing the effective compression), however, the error rate can be reduced to quite reasonable proportions.
As this article indicates, you can encode up to 1 *million* URLs and have a 1-in-1-million false positive rate while storing only 6 bytes per key. That's LESS than the 'www.' and '.com' you're almost guaranteed to find in every URL (8 bytes).
}"for split/</;$_=$,,y[A-Z a-z]