Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Best way to search file

by sundialsvc4 (Abbot)
on Apr 15, 2015 at 17:16 UTC ( #1123532=note: print w/replies, xml ) Need Help??

in reply to Best way to search file

A lot depends on just how big the two files are.   If they are of a size that would comfortably(!) fit in memory, a hash would do nicely.   If they are larger, consider either sorting the two files (which will reduce the problem to a simple “merge”), or perhaps use an [Sqlite?] database file.   If you do the latter, your problem becomes an INNER JOIN.

“Two identically sorted files” is the old-school technique ... that’s literally what they were doing with all those tape drives, in the days of yore ... but it is a good one, especially if one or both of the files are already sorted and can stay that way.   The entire operation can be peformed using one sequential pass through both files, no matter how large they are.   The price-paid is the cost of sorting.   (That cost is amortized if the file, known to be sorted and kept sorted, can then be reused in the future.   A sequential file can be sequentially-updated by a sorted transaction file that is applied to it by appropriate code, and this also occurs in one sequential pass.)

If you use a hash, the operative word is “comfortably.”   If the hash is so large that the operating-system starts paging, a hash can perform exceptionally badly because it makes fairly-random references to memory addresses.   Hashes exhibit the opposite of the “locality of reference” behavior upon which efficient virtual-memory depends.

If you use SQLite, then once again you are paying a stiff file-copying price ... unless you can use the file multiple times in multiple runs ... say, using it as your master-file instead of your present flat-file #1.

Replies are listed 'Best First'.
Re^2: Best way to search file
by Theodore (Friar) on Apr 16, 2015 at 13:49 UTC
    (...) or perhaps use an Sqlite? database file. If you do the latter, your problem becomes an INNER JOIN.
    Or even easier, if the files are small enough to fit in ram, use DBD::RAM which supports both file formats.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1123532]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2023-05-28 20:45 GMT
Find Nodes?
    Voting Booth?

    No recent polls found