Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

xtracting unique lines

by anasuya (Novice)
on Mar 27, 2012 at 18:02 UTC ( [id://961989]=perlquestion: print w/replies, xml ) Need Help??

anasuya has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I have a file which looks like this. It has two fields which are separated by a '+' sign.

d_145_1_2- + c_3_1_8-e_74_1_1- a_100_1_6-c_2_1_6- + b_50_1_2- c_69_1_17- + b_61_6_1- c_2_1_2- + a_123_1_1- d_83_1_1- + c_2_1_5-d_162_1_1- c_2_1_2- + a_123_1_1- a_123_1_1- + c_2_1_2-

What I need to do is to extract out lines which are unique in this file. For example here,from the snippet of the file above, the following lines are unique:

d_145_1_2- + c_3_1_8-e_74_1_1- a_100_1_6-c_2_1_6- + b_50_1_2- c_69_1_17- + b_61_6_1- c_2_1_2- + a_123_1_1- d_83_1_1- + c_2_1_5-d_162_1_1-

One shall notice that the fields a_123_1_1- and c_2_1_2- occur as a pair more than once, however in such a way that their relative order is reversed. Is there anyway I can extract out unique lines, keeping only one occurrence of such pairs i.e. a_123_1_1- and c_2_1_2-? I have as of now tried awk. There, I was unable to retrieve unique lines using the uniq function as that doesn't take care of the same combinations of fields repeating in reverse orders. Also I tried merging the two fields together and then carrying out awk operations but to no avail. Is there any way such that perl makes the job easier?

Replies are listed 'Best First'.
Re: xtracting unique lines
by Happy-the-monk (Canon) on Mar 27, 2012 at 18:22 UTC

    I'd split the pairs by the " + " string, sort {$a cmp $b} the pair into a temporary array. Use as hash key a string joined by the " + " string made of that array... and the original string as the hash value. When done print out all the values.

    To shorten it, make the split, sort and join in one go and you get rid of the temporary array.

    Cheers, Sören

Re: xtracting unique lines
by nemesdani (Friar) on Mar 27, 2012 at 18:19 UTC
    Read the file.
    Make a hash.
    Split the line.
    Check each part, if it exists in the hash.
    If not, fill the fields into the hash.
    Have fun while doing it!
Re: xtracting unique lines
by Cristoforo (Curate) on Mar 28, 2012 at 02:03 UTC
    Using grep, you can filter out duplicate fields by testing to see if they have been seen yet.
    #!/usr/bin/perl use strict; use warnings; my %seen; { local $\ = "\n"; # call to print() ends in newline while (<DATA>) { chomp; print unless grep $seen{$_}++, split /\s+\+\s+/; } }

    Chris

    Update: Misread the question, missed that they can occur reversed.

    This should produce the results.

    #!/usr/bin/perl use strict; use warnings; my %seen; { local $\ = "\n"; # call to print() ends in newline while (<DATA>) { chomp; my $sorted = join "", sort split /\s\+\s/; print unless $seen{$sorted}++; } } __DATA__ d_145_1_2- + c_3_1_8-e_74_1_1- a_100_1_6-c_2_1_6- + b_50_1_2- c_69_1_17- + b_61_6_1- c_2_1_2- + a_123_1_1- d_83_1_1- + c_2_1_5-d_162_1_1- c_2_1_2- + a_123_1_1- a_123_1_1- + c_2_1_2-

      Hi. I tried out what you sed above. It worked. thanks.. Now what i need to do further is count the occurrences of each of these lines. As you can see in <DATA>, the string "c_2_1_2- + a_123_1_1-" has occurred 2 times and the reverse of it "a_123_1_1- + c_2_1_2-" has occurred once. Now i need to get a cumulative count for this pair (irrespective of the order in which it occurs i.e. as "a_123_1_1- + c_2_1_2-" or as "c_2_1_2- + a_123_1_1-", so that the total count of this entry is =3 as in <DATA>) The actual file which i am working on is similar but is larger in size, and has around 8000 lines. What is the solution to this problem? awk hasn't helped me so far.

Re: xtracting unique lines
by johngg (Canon) on Mar 28, 2012 at 07:29 UTC

    This is similar to Cristoforo's solution but using a sort of Schwartzian Transform to sort the keys for the %seen hash.

    knoppix@Microknoppix:~$ perl -E ' > open my $inFH, q{<}, \ <<EOD or die qq{open: <<HEREDOC: $!\n}; > d_145_1_2- + c_3_1_8-e_74_1_1- > a_100_1_6-c_2_1_6- + b_50_1_2- > c_69_1_17- + b_61_6_1- > c_2_1_2- + a_123_1_1- > d_83_1_1- + c_2_1_5-d_162_1_1- > c_2_1_2- + a_123_1_1- > a_123_1_1- + c_2_1_2- > EOD > > my %seen; > print > map { qq{$_->[ 0 ]\n} } > grep { ! $seen{ $_->[ 1 ] } ++ } > map { chomp; [ $_, join( q{:}, sort split m{ \+ }, $_ ) ] } > <$inFH>;' d_145_1_2- + c_3_1_8-e_74_1_1- a_100_1_6-c_2_1_6- + b_50_1_2- c_69_1_17- + b_61_6_1- c_2_1_2- + a_123_1_1- d_83_1_1- + c_2_1_5-d_162_1_1- knoppix@Microknoppix:~$

    I hope this is of interest.

    Cheers,

    JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://961989]
Approved by philipbailey
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-04-24 04:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found