how to compare column 1 to column 2 and vice versa from multiple rows.

BhariD has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to compare column 1 to column 2 and vice versa from multiple rows. by kennethk (Abbot) on Sep 30, 2009 at 17:52 UTC
First, you might do well to read How do I post a question effectively? and Markup in the Monastery to improve the clarity and readability of your question. Well-written questions tend to get a much better response. It sounds like you have a CSV file, for which you may want to use a CSV module. One straight-forward-to-use and well-vetted one is Text::CSV. While this may be a simple case, frequently it is the simple stuff that you have to keep returning to and reusing... I do not see the matching you claim in the post. Is this simple equality, or is there a more subtle pattern? Assuming simple equality, this is a good opportunity to use hashes to organize your data. If you are unfamiliar with hashes, see Perl variable types. If you are having problems reading in the file in the first place, check out Files and I/O. One implementation that does what I think you want is: `#!/usr/bin/perl use strict; use warnings; my %data = (); while (<DATA>) { my ($term1, $term2) = split; if (exists $data{$term2}) { print "$term2 found in both columns\n"; } $data{$term1} = $term2; } __DATA__ NP_041954.1 NP_848263.1 NP_041955.1 NP_041956.1 I_match Not_here_though I_dont_match I_match` [download] Note I test before I store the value so that the entries must be in different rows (the spec as I understand it).	[reply] [d/l]
Re: how to compare column 1 to column 2 and vice versa from multiple rows. by CountZero (Bishop) on Oct 01, 2009 at 06:34 UTC
May be your question gets a little clearer if I reformat the data: `NP_041954.1 NP_848263.1 NP_041955.1 NP_041956.1 NP_041956.1 NP_041955.1 NP_041957.1 NP_848264.1` [download] Nope, it does not get clearer. The first row (`row[0]`) has four elements/columns; the second row (`row[1]`) and third row (`row[2]`) have two elements each. Perhaps you mean the structure to be: `NP_041954.1 NP_848263.1 NP_041955.1 NP_041956.1 NP_041956.1 NP_041955.1 NP_041957.1 NP_848264.1` [download] That makes a bit more sense. You need to print out the columns where column`[0]` of one row is equal to column`[1]` of another row and vice versa. The result should be `NP_041955.1` and `NP_041956.1`, but that doesn't look right as these are not columns but elements. OK, we will assume you meant elements. Assuming you are only interested in these elements, then the best and fastest way to do this is to store each column in its own hash and check each element of the first hash against the second hash and vice-versa and print out the matches. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: how to compare column 1 to column 2 and vice versa from multiple rows. by ccn (Vicar) on Sep 30, 2009 at 17:43 UTC
Something like this: `my %seen; while (my $line = <>) { chomp $line; print $line if $seen{ join '', sort split /\s+/, $line }++; }` [download] Use a hash to check for duplicates. Compose a key for the hash in such a way that rows having similar columns give same keys. Update: Missed ++ has been added.	[reply] [d/l]
Re^2: how to compare column 1 to column 2 and vice versa from multiple rows. by BhariD (Sexton) on Oct 01, 2009 at 23:47 UTC
Thank you so much!! your suggestions really helped. I apologize for the formatting errors. I hope its not too bad this time. Can I ask you one more question. With this input file (below): gene_a gene_b gene_b gene_a I get the following output: gene_a gene_b If the input file is something like this: gene_a gene_b gene_b gene_a gene_c gene_a gene_a gene_c gene_c gene_b gene_b gene_c Then I want the program to output the following: gene_a gene_b gene_c instead of: gene_a gene_b gene_b gene_c gene_c gene_a The thing is I am looking for pairs for which column[0] is equal to column1 and vice versa. This can happen for any combination of numbers (as I showed with three above a, b and c). Can you provide your suggestion in this case. I would really really appreciate it! Thanks BH	[reply]
Re^3: how to compare column 1 to column 2 and vice versa from multiple rows. by ccn (Vicar) on Oct 02, 2009 at 06:13 UTC
It is not too late to insert `<code>` tags into your original post. You are able to update it any time. As I understand you just want to output unique names of genes instead of raw rows. Than try this `#!/usr/bin/perl -lan # Usage: thisscript.pl genes.txt if ( $seen{ join ' ', sort @F }++ ) { $uniq{$F[0]}++; $uniq{$F[1]}++; } END { print for keys %uniq; }` [download] And this: `Linux version: perl -lane '@u{@F}=() if $s{join "", sort @F}++ }{ print for keys %u' +genes.txt Windows version: perl -lane "@u{@F}=() if $s{join '', sort @F}++ }{ print for keys %u" +genes.txt` [download] Where `genes.txt` is a file containing gene rows Feel free to ask if you need explanations on the algorithm and it's implementation.	[reply] [d/l] [select]
Re^4: how to compare column 1 to column 2 and vice versa from multiple rows. by BhariD (Sexton) on Oct 02, 2009 at 18:46 UTC
Re^5: how to compare column 1 to column 2 and vice versa from multiple rows. by ccn (Vicar) on Oct 02, 2009 at 20:44 UTC


Keep It Simple, Stupid
	PerlMonks