Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Command Line Hash to print things in common between two files

by ZWcarp (Beadle)
on Jan 10, 2012 at 20:08 UTC ( #947231=perlquestion: print w/ replies, xml ) Need Help??
ZWcarp has asked for the wisdom of the Perl Monks concerning the following question:

Hello all. I would like to do something akin to the join command using perl at the command line to make a hash and print lines where a value is in common between two files. The two files might have in common lets say... a gene identifier number or a CG number. These are always numbers and letter delimited somehow. What would the best way to do this at the command line be?

Original content restored above by GrandFather

I would like to accomplish this
use strict; open (FILEHANDLE, "$ARGV[0]") || die("Could not open file 1 input file +"); my @file1 = <FILEHANDLE> ; close (FILEHANDLE); open (FILE2, "$ARGV[1]") || die ("Could not open file 2 input file +"); my @SAVI = <FILE2>; close (FILE2); foreach my $line1 (@file1) { chomp ($line1); (my $var1, my $var2) = split(/\t/,$line1); foreach my $line2 (@file2) { chomp($line2); (my $Var1, my $Var2)= split(/\t/,$line2); if ($var1=~m/$Var1/) { print $line1 ."\t" . $line2 . "\n"; } } }
From the command line using maybe a hash or something so that its faster. Does anyone know how to do this sort of operation in a nice compact form? Basically just a way to see if a value in a column in one file appears in some form somewhere in a second file and printing the lines that are satisfy this.

Comment on Command Line Hash to print things in common between two files
Download Code
Re: Command Line Hash to print things in common between two files
by umasuresh (Hermit) on Jan 10, 2012 at 20:34 UTC
      This is on the right track but I need to be able to tell it which columns to check because the whole lines are never going to be in common?
        You will get more useful answers if you show a few lines of the file(s) to be analysed.


        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Despite your requirement that this be "on the command line," you might solve this yourself by understanding and then extending the example offered or an answer which can be found in one of the many other SOPW's asking about essentially the same chore.

        And, yes, it's more likely the latter, since you want to test the content of a specific column (you didn't say which one) in each line in file against the content of a specific column in any line in a second file...

        ... or is that not what you meant? The phrase "where the column from file 1 matches somewhere in file 2" makes me wonder if you're looking for any column in a given (same line number) line in file 2 that matches the content of the specified column in a particular line in file 1. Your reply to the first answer would appear to rule that out were it not for the terminal punctuation -- a question mark!

        The first step to solving your problem is probably re-stating it to yourself, in a clear, precise and unambiguous manner.

        Update: Upon posting this reply, discovered that ZWcarp had made major, un-acknowledged revisions to the OP. meh!
        Added: (and his code doesn't compile under strict. At line 15, Global symbol "@file2" requires explicit package name

        Re-updated. (Yech): OP's first update (prior to adding the reference to "a gene identifier number or a CG number. These are always numbers and letter delimited somehow.") left the requirement ambiguous (at least to me) so I prepped this, seeking clarification. Clearly, it's not characteristic of the new spec, but, FTR:

        File 1 File2 Col 1 Col2 Col3 Col4 Col 1 Col2 Col3 Col4 1 2 3 4 4 3 2 1 4 3 2 1 a b c d 10 11 12 13 12 11 13 10 a1 b c d a4 b4 c d4 Line 1: no matches Line 2: # F1, L2 matches F2, L1 Line 3: # F1, L3,Col2 matches F2, L3, Col2 Line 4: # F1, L4,Cols 2, 3 & 4 match F2, L2, Cols 2, 3 & 4 # and also matches contents of F2, L4, Col3 # Do both satisfy your criteria?

        Where "F1" (in the data sample) means File1, "L2" means Line 2 and "Col" and "Cols" are -- I hope -- self explanatory.

Re: Command Line Hash to print things in common between two files
by tobyink (Abbot) on Jan 11, 2012 at 16:59 UTC
    perl -E'say "Things in common between the files: ".(join ", ", @ARGV); say " * they evoke a sense of nostalgia for 1960s Paris."; say " * that is all."'

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://947231]
Approved by pemungkah
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (16)
As of 2014-09-18 15:10 GMT
Find Nodes?
    Voting Booth?

    How do you remember the number of days in each month?

    Results (116 votes), past polls