in reply to
Re: Command Line Hash to print things in common between two files
in thread Command Line Hash to print things in common between two files
This is on the right track but I need to be able to tell it which columns to check because the whole lines are never going to be in common?
Re^3: Command Line Hash to print things in common between two files by CountZero (Chancellor) on Jan 10, 2012 at 21:22 UTC |
You will get more useful answers if you show a few lines of the file(s) to be analysed.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] |
Re^3: Command Line Hash to print things in common between two files by ww (Chancellor) on Jan 10, 2012 at 22:28 UTC |
Despite your requirement that this be "on the command line," you might solve this yourself by understanding and then extending the example offered or an answer which can be found in one of the many other SOPW's asking about essentially the same chore.
And, yes, it's more likely the latter, since you want to test the content of a specific column (you didn't say which one) in each line in file against the content of a specific column in any line in a second file...
... or is that not what you meant? The phrase "where the column from file 1 matches somewhere in file 2" makes me wonder if you're looking for any column in a given (same line number) line in file 2 that matches the content of the specified column in a particular line in file 1. Your reply to the first answer would appear to rule that out were it not for the terminal punctuation -- a question mark!
The first step to solving your problem is probably re-stating it to yourself, in a clear, precise and unambiguous manner.
Update: Upon posting this reply, discovered that ZWcarp had made major, un-acknowledged revisions to the OP. meh!
Added: (and his code doesn't compile under strict. At line 15, Global symbol "@file2" requires explicit package name
Re-updated. (Yech): OP's first update (prior to adding the reference to "a gene identifier number or a CG number. These are always numbers and letter delimited somehow.") left the requirement ambiguous (at least to me) so I prepped this, seeking clarification. Clearly, it's not characteristic of the new spec, but, FTR:
File 1 File2
Col 1 Col2 Col3 Col4 Col 1 Col2 Col3 Col4
1 2 3 4 4 3 2 1
4 3 2 1 a b c d
10 11 12 13 12 11 13 10
a1 b c d a4 b4 c d4
Line 1: no matches
Line 2: # F1, L2 matches F2, L1
Line 3: # F1, L3,Col2 matches F2, L3, Col2
Line 4: # F1, L4,Cols 2, 3 & 4 match F2, L2, Cols 2, 3 & 4
# and also matches contents of F2, L4, Col3
# Do both satisfy your criteria?
Where "F1" (in the data sample) means File1, "L2" means Line 2 and "Col" and "Cols" are -- I hope -- self explanatory.
| [reply] [d/l] [select] |
|
| [reply] |
|
As hinted at in one of the earlier replies, sometimes it's worth the effort to create a suitable utility to make a "simple" operation even simpler. It also allows you to add in some useful flexibility that will help to make your command line usage more effective with less typing.
I have to do a lot of "join"-like operations (actually, things like intersections, unions, and xors) on pairs of arbitrary lists or tables that vary as to delimiters and locations of key fields, so I wrote this "general purpose" tool: cmpcol. You haven't shown any samples of your data yet, so I don't know whether this tool might be useful to you, but I've had occasion to use it (and be glad to have it) just about every day since I wrote it.
| [reply] |
|
First, the assumption that a "more compact" Perl program will
execute faster is not true. In fact the opposite is often true!
The algorithm used will typically make far, far more difference.
Also aside from execution speed, Perl compiles at lightning speed
and whether you have a "one liner" or 1,000 lines usually makes
no real difference at all.
graff's cmpcol utility looks to be pretty flexible. If that
critter does all you need, then I think we're done.
I see that the content of the OP (original post) has been restored.
A few general comments on it related to performance:
1) In general, reading a line at a time and processing it right then
works out better than
slurping all the data into an array which is then later processed
line by line anyway. You start out by essentially
making a verbatim memory resident copy of both files. If they
are big files, this alone will take noticeable time. Aside from the
file I/O time, the construction (memory allocation) and copying of
the data into the array takes time.
2) For every line the first file, you cycle through all of the lines
in the second file. This can be very expensive execution time-wise!
This is a: #lines(file 1) * #lines(file 2) situation.
3) Going back to re-process the same data again and again is "expensive".
Perl split() is a nice critter, but this is not a "cheap" function. Every
trip through the file2 data (of possibly many trips) requires this
at each line.
4) To make your code faster, then general idea would be to "do something
very significant" with each line read and to the extent possible, don't
process the same data twice.
5) I would be thinking of making a data structure, an AoA or a hash table
for the first file (not a simple "verbatim" copy of that file) which
contains the "search or join term" and the complete line (for output).
Cycle through file2 just once. At each line, decide if there is a match or not
with some term in the file1 data structure. That way file2 is only processed
one time.
6) One technique that is sometimes overlooked, is that with Perl you can
build dynamic regex'es on the fly! You could build a single regex that
describes all of the terms in file1 and run that regex against each
sequential line of file2. my (@terms_found) =~ m/...huge regex.../g; Use the
"quote regex", qr syntax.
7) Another technique that is sometimes overlooked is the use of system
sort to simply the processing. If these are really big files, this idea
may work out also.
The possibilities to fine tune the performance are not endless, but many.
Some examples of your files as well as typical sizes would be very
appropriate. I think if you implement to step(5) of the above, the performance increase will be noticeable. Again, split() is great, but it is not a "cheap" function in terms of CPU. If you just put file2 into a better structure and didn't run split() so often, that alone would increase performance.
| [reply] |
|
|