Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^4: Command Line Hash to print things in common between two files

by ZWcarp (Beadle)
on Jan 10, 2012 at 23:27 UTC ( #947255=note: print w/replies, xml ) Need Help??

in reply to Re^3: Command Line Hash to print things in common between two files
in thread Command Line Hash to print things in common between two files

I apologize my question is confusing and that I haven't followed proper update protocol. Please do not interpret my mistakes as a lack of care to post things the proper way on this forum. I changed around the question multiple times in order to attempt to make it more clear, and to try and address the previous answers. The script I posted doesn't compile because I edited all the variable names to try to make them generic and thus clear, and I left one unchanged by mistake.

All I want to do is learn how to use perl to accomplish the same thing done with the unix join command in a bash environment because it would make my life much easier if I didn't have to write a script every time I need to analyze a particular unison between two files. Join works great but I would like to have finer control than it allows.

  • Comment on Re^4: Command Line Hash to print things in common between two files

Replies are listed 'Best First'.
Re^5: Command Line Hash to print things in common between two files
by graff (Chancellor) on Jan 11, 2012 at 06:51 UTC
    As hinted at in one of the earlier replies, sometimes it's worth the effort to create a suitable utility to make a "simple" operation even simpler. It also allows you to add in some useful flexibility that will help to make your command line usage more effective with less typing.

    I have to do a lot of "join"-like operations (actually, things like intersections, unions, and xors) on pairs of arbitrary lists or tables that vary as to delimiters and locations of key fields, so I wrote this "general purpose" tool: cmpcol. You haven't shown any samples of your data yet, so I don't know whether this tool might be useful to you, but I've had occasion to use it (and be glad to have it) just about every day since I wrote it.

Re^5: Command Line Hash to print things in common between two files
by Marshall (Abbot) on Jan 11, 2012 at 16:41 UTC
    First, the assumption that a "more compact" Perl program will execute faster is not true. In fact the opposite is often true! The algorithm used will typically make far, far more difference.

    Also aside from execution speed, Perl compiles at lightning speed and whether you have a "one liner" or 1,000 lines usually makes no real difference at all.

    graff's cmpcol utility looks to be pretty flexible. If that critter does all you need, then I think we're done.

    I see that the content of the OP (original post) has been restored. A few general comments on it related to performance:

    1) In general, reading a line at a time and processing it right then works out better than slurping all the data into an array which is then later processed line by line anyway. You start out by essentially making a verbatim memory resident copy of both files. If they are big files, this alone will take noticeable time. Aside from the file I/O time, the construction (memory allocation) and copying of the data into the array takes time.

    2) For every line the first file, you cycle through all of the lines in the second file. This can be very expensive execution time-wise! This is a: #lines(file 1) * #lines(file 2) situation.

    3) Going back to re-process the same data again and again is "expensive". Perl split() is a nice critter, but this is not a "cheap" function. Every trip through the file2 data (of possibly many trips) requires this at each line.

    4) To make your code faster, then general idea would be to "do something very significant" with each line read and to the extent possible, don't process the same data twice.

    5) I would be thinking of making a data structure, an AoA or a hash table for the first file (not a simple "verbatim" copy of that file) which contains the "search or join term" and the complete line (for output). Cycle through file2 just once. At each line, decide if there is a match or not with some term in the file1 data structure. That way file2 is only processed one time.

    6) One technique that is sometimes overlooked, is that with Perl you can build dynamic regex'es on the fly! You could build a single regex that describes all of the terms in file1 and run that regex against each sequential line of file2. my (@terms_found) =~ m/...huge regex.../g; Use the "quote regex", qr syntax.

    7) Another technique that is sometimes overlooked is the use of system sort to simply the processing. If these are really big files, this idea may work out also.

    The possibilities to fine tune the performance are not endless, but many. Some examples of your files as well as typical sizes would be very appropriate. I think if you implement to step(5) of the above, the performance increase will be noticeable. Again, split() is great, but it is not a "cheap" function in terms of CPU. If you just put file2 into a better structure and didn't run split() so often, that alone would increase performance.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://947255]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2018-02-25 08:31 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (312 votes). Check out past polls.