Re: xtracting unique lines

Using grep, you can filter out duplicate fields by testing to see if they have been seen yet.

#!/usr/bin/perl
use strict;
use warnings;

my %seen;
{
    local $\ = "\n"; # call to print() ends in newline
    while (<DATA>) {
        chomp;
        print unless grep $seen{$_}++, split /\s+\+\s+/;
    }
}
[download]

Chris

Update: Misread the question, missed that they can occur reversed.

This should produce the results.

#!/usr/bin/perl
use strict;
use warnings;

my %seen;
{
    local $\ = "\n"; # call to print() ends in newline
    while (<DATA>) {
        chomp;
        my $sorted = join "", sort split /\s\+\s/;
        print unless $seen{$sorted}++;
    }
}

__DATA__
d_145_1_2- + c_3_1_8-e_74_1_1-
a_100_1_6-c_2_1_6- + b_50_1_2-
c_69_1_17- + b_61_6_1-
c_2_1_2- + a_123_1_1-
d_83_1_1- + c_2_1_5-d_162_1_1-
c_2_1_2- + a_123_1_1-
a_123_1_1- + c_2_1_2-
[download]

Comment on Re: xtracting unique lines Select or Download Code

Replies are listed 'Best First'.
Re^2: xtracting unique lines by anasuya (Novice) on Mar 28, 2012 at 11:07 UTC
Hi. I tried out what you sed above. It worked. thanks.. Now what i need to do further is count the occurrences of each of these lines. As you can see in <DATA>, the string "c_2_1_2- + a_123_1_1-" has occurred 2 times and the reverse of it "a_123_1_1- + c_2_1_2-" has occurred once. Now i need to get a cumulative count for this pair (irrespective of the order in which it occurs i.e. as "a_123_1_1- + c_2_1_2-" or as "c_2_1_2- + a_123_1_1-", so that the total count of this entry is =3 as in <DATA>) The actual file which i am working on is similar but is larger in size, and has around 8000 lines. What is the solution to this problem? awk hasn't helped me so far.	[reply]

Replies are listed 'Best First'.

Re^2: xtracting unique lines
by anasuya (Novice) on Mar 28, 2012 at 11:07 UTC

Hi. I tried out what you sed above. It worked. thanks.. Now what i need to do further is count the occurrences of each of these lines. As you can see in <DATA>, the string "c_2_1_2- + a_123_1_1-" has occurred 2 times and the reverse of it "a_123_1_1- + c_2_1_2-" has occurred once. Now i need to get a cumulative count for this pair (irrespective of the order in which it occurs i.e. as "a_123_1_1- + c_2_1_2-" or as "c_2_1_2- + a_123_1_1-", so that the total count of this entry is =3 as in <DATA>) The actual file which i am working on is similar but is larger in size, and has around 8000 lines. What is the solution to this problem? awk hasn't helped me so far.

[reply]

In Section Seekers of Perl Wisdom