Not entirely sure if this is appropriate for your specific problem, but I am dealing very regularly with huge files too large to fit in a hash in memory, for the purpose of removing duplicates, comparing two files, etc. I found that the fastest way is very often to sort the files according to the comparison key, using the Unix sort utility, and then read them sequentially. For what I am doing, this is consistently more than one order of magnitude faster than loading the data into a database.
The algorithm is then more complicated than just using a %seen hash, but nothing really hard.