Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

how to compare column 1 to column 2 and vice versa from multiple rows.

by BhariD (Sexton)
on Sep 30, 2009 at 17:33 UTC ( #798411=perlquestion: print w/ replies, xml ) Need Help??
BhariD has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am new to perl, and trying to extract raw data into a more meaningful format. Any kind of help will be appreciated. I have this input file with two columns each one is filled in an array NP_041954.1 NP_848263.1 NP_041955.1 NP_041956.1 ----- 1 row NP_041956.1 NP_041955.1 ------ 2 row NP_041957.1 NP_848264.1 I want to get the columns for the condition where- column[0] of one row is equal to column1 of another row and vice versa. For example this condition holds true for 1st and 2nd row above. I need the print out only once showing the column 1 and column 2 (i.e in this case- NP_041955.1 AND NP_041956.1) for which the condition is true. What will be a good approach to get this? Thanks

Comment on how to compare column 1 to column 2 and vice versa from multiple rows.
Re: how to compare column 1 to column 2 and vice versa from multiple rows.
by ccn (Vicar) on Sep 30, 2009 at 17:43 UTC
    Something like this:
    my %seen; while (my $line = <>) { chomp $line; print $line if $seen{ join '', sort split /\s+/, $line }++; }

    Use a hash to check for duplicates. Compose a key for the hash in such a way that rows having similar columns give same keys.

    Update: Missed ++ has been added.

      Thank you so much!! your suggestions really helped. I apologize for the formatting errors. I hope its not too bad this time.

      Can I ask you one more question. With this input file (below):
      gene_a gene_b
      gene_b gene_a

      I get the following output:
      gene_a gene_b

      If the input file is something like this:
      gene_a gene_b
      gene_b gene_a
      gene_c gene_a
      gene_a gene_c
      gene_c gene_b
      gene_b gene_c

      Then I want the program to output the following:
      gene_a gene_b gene_c

      instead of:
      gene_a gene_b
      gene_b gene_c
      gene_c gene_a

      The thing is I am looking for pairs for which column[0] is equal to column1 and vice versa. This can happen for any combination of numbers (as I showed with three above a, b and c). Can you provide your suggestion in this case. I would really really appreciate it!

      Thanks

      BH

        It is not too late to insert <code> tags into your original post. You are able to update it any time.

        As I understand you just want to output unique names of genes instead of raw rows. Than try this

        #!/usr/bin/perl -lan # Usage: thisscript.pl genes.txt if ( $seen{ join ' ', sort @F }++ ) { $uniq{$F[0]}++; $uniq{$F[1]}++; } END { print for keys %uniq; }

        And this:

        Linux version: perl -lane '@u{@F}=() if $s{join "", sort @F}++ }{ print for keys %u' +genes.txt Windows version: perl -lane "@u{@F}=() if $s{join '', sort @F}++ }{ print for keys %u" +genes.txt

        Where genes.txt is a file containing gene rows

        Feel free to ask if you need explanations on the algorithm and it's implementation.

Re: how to compare column 1 to column 2 and vice versa from multiple rows.
by kennethk (Monsignor) on Sep 30, 2009 at 17:52 UTC
    First, you might do well to read How do I post a question effectively? and Markup in the Monastery to improve the clarity and readability of your question. Well-written questions tend to get a much better response.

    It sounds like you have a CSV file, for which you may want to use a CSV module. One straight-forward-to-use and well-vetted one is Text::CSV. While this may be a simple case, frequently it is the simple stuff that you have to keep returning to and reusing...

    I do not see the matching you claim in the post. Is this simple equality, or is there a more subtle pattern? Assuming simple equality, this is a good opportunity to use hashes to organize your data. If you are unfamiliar with hashes, see Perl variable types. If you are having problems reading in the file in the first place, check out Files and I/O. One implementation that does what I think you want is:

    #!/usr/bin/perl use strict; use warnings; my %data = (); while (<DATA>) { my ($term1, $term2) = split; if (exists $data{$term2}) { print "$term2 found in both columns\n"; } $data{$term1} = $term2; } __DATA__ NP_041954.1 NP_848263.1 NP_041955.1 NP_041956.1 I_match Not_here_though I_dont_match I_match

    Note I test before I store the value so that the entries must be in different rows (the spec as I understand it).

Re: how to compare column 1 to column 2 and vice versa from multiple rows.
by CountZero (Bishop) on Oct 01, 2009 at 06:34 UTC
    May be your question gets a little clearer if I reformat the data:
    NP_041954.1 NP_848263.1 NP_041955.1 NP_041956.1 NP_041956.1 NP_041955.1 NP_041957.1 NP_848264.1
    Nope, it does not get clearer. The first row (row[0]) has four elements/columns; the second row (row[1]) and third row (row[2]) have two elements each.

    Perhaps you mean the structure to be:

    NP_041954.1 NP_848263.1 NP_041955.1 NP_041956.1 NP_041956.1 NP_041955.1 NP_041957.1 NP_848264.1
    That makes a bit more sense.

    You need to print out the columns where column[0] of one row is equal to column[1] of another row and vice versa. The result should be NP_041955.1 and NP_041956.1, but that doesn't look right as these are not columns but elements. OK, we will assume you meant elements.

    Assuming you are only interested in these elements, then the best and fastest way to do this is to store each column in its own hash and check each element of the first hash against the second hash and vice-versa and print out the matches.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://798411]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (15)
As of 2014-07-22 14:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (114 votes), past polls