Re: Best way to search file

A lot depends on just how big the two files are. If they are of a size that would comfortably(!) fit in memory, a hash would do nicely. If they are larger, consider either sorting the two files (which will reduce the problem to a simple “merge”), or perhaps use an [Sqlite?] database file. If you do the latter, your problem becomes an INNER JOIN.

“Two identically sorted files” is the old-school technique ... that’s literally what they were doing with all those tape drives, in the days of yore ... but it is a good one, especially if one or both of the files are already sorted and can stay that way. The entire operation can be peformed using one sequential pass through both files, no matter how large they are. The price-paid is the cost of sorting. (That cost is amortized if the file, known to be sorted and kept sorted, can then be reused in the future. A sequential file can be sequentially-updated by a sorted transaction file that is applied to it by appropriate code, and this also occurs in one sequential pass.)

If you use a hash, the operative word is “comfortably.” If the hash is so large that the operating-system starts paging, a hash can perform exceptionally badly because it makes fairly-random references to memory addresses. Hashes exhibit the opposite of the “locality of reference” behavior upon which efficient virtual-memory depends.

If you use SQLite, then once again you are paying a stiff file-copying price ... unless you can use the file multiple times in multiple runs ... say, using it as your master-file instead of your present flat-file #1.

Replies are listed 'Best First'.

Re^2: Best way to search file
by Theodore (Friar) on Apr 16, 2015 at 13:49 UTC

(...) or perhaps use an Sqlite? database file. If you do the latter, your problem becomes an INNER JOIN.

DBD::RAM

[reply]


No such thing as a small change
	PerlMonks