Beefy Boxes and Bandwidth Generously Provided by pair Networks httptech
more useful options
 
PerlMonks  

Selective printing of the Duplicates

by Thomas Kennll (Acolyte)
on Jan 30, 2013 at 17:21 UTC ( #1016111=perlquestion: print w/ replies, xml ) Need Help??
Thomas Kennll has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I have a group of records. Based on a column(key), I'm trying to identify the duplicates and after identifying the duplicates I want to do a selective print on the duplicates.. My data is as below....
DATA 30380868 N Sep 29 356200 AGEC682569 ATI + S 30380868 ** N Sep 29 356200 AGEC682569 ATI + S 71130740 N Sep 7 SM9481 AGEC683966 ATI + S 71130740 ** N Sep 7 SM9481 AGEC683966 ATI + S 32450045 N Jul 14 SN9672 AGEC685203 ATI + S 32450045 ** N Jul 14 SN9672 AGEC685203 ATI + S 36450223 N Aug 30 SU8329 AGEC685348 ATI + S 34680135 N Sep 30 349450 AGEC685442 ATI + S DESIRED OUTCOME 30380868 ** N Sep 29 356200 AGEC682569 ATI + S 71130740 ** N Sep 7 SM9481 AGEC683966 ATI + S 32450045 ** N Jul 14 SN9672 AGEC685203 ATI + S 36450223 N Aug 30 SU8329 AGEC685348 ATI + S 34680135 N Sep 30 349450 AGEC685442 ATI + S
So, I will be using key column which is column1 to filter out duplicates, for eg, here column1 (30380868) is repeating.. I want to print the one with "**" as 2nd column and ignore the rest of duplicates.. If a record key column is not repeating, print as it is.. I tried this
my %seen = (); $seen{$_}++; next if $seen{$_} > 1; print;
But, it doesnt give me the desired result... Can someone please help!

Comment on Selective printing of the Duplicates
Select or Download Code
Re: Selective printing of the Duplicates
by Anonymous Monk on Jan 30, 2013 at 17:26 UTC

    But, it doesnt give me the desired result...

    Of course it doesn't, you never read a file

      I just put, the logic.. My code is a pretty big one, extracting from a DB and then trying to put it on a file. But, I'm stuck in this part of the logic wherein if I try to filter the duplicates always the 1st is picked the remaining is removed.. I wanted to know how can I ignore the 1st duplicate and pick the 2nd one..
        If they are duplicates, how do you tell the 1st from the 2nd one?

        Anyway, if you use only a part of a line to find the duplicates, it makes sense. You can either reverse the input file and use the old algorithm, or you have to remember the last line seen for each key in a hash. The next problem of the latter is to get the original order of the lines.

        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Now is it the last one or the second one. Are these duplicates or multiplicities??

        So, given that these are repeated several times and you need the second one I would use :

        my %seen = (); my $tmp ... $seen{$_}++; if ($_ ne $tmp && $seen{$tmp} == 1){ print $_; $tmp =$_; }elsif ($_ eq $tmp){ print $_ if $seen{$_} == 2; }
        however if it is the last one then:
        my %seen = (); my $tmp; my $id; ... if ($_ ne $tmp){ $seen{$tmp}=$id++; } and then in the second loop: print $_ foreach (sort{$seen{$a}<=>$seen{$b}}keys %seen);
        baxy
Re: Selective printing of the Duplicates
by Kenosis (Priest) on Jan 30, 2013 at 17:48 UTC

    Perhaps the following will help:

    use strict; use warnings; my %seen; while (<DATA>) { chomp; print "$_\n" unless $seen{$_}++; } __DATA__ 30380868 N Sep 29 356200 AGEC682569 ATI + S 30380868 N Sep 29 356200 AGEC682569 ATI + S 71130740 N Sep 7 SM9481 AGEC683966 ATI + S 71130740 N Sep 7 SM9481 AGEC683966 ATI + S 32450045 N Jul 14 SN9672 AGEC685203 ATI + S 32450045 N Jul 14 SN9672 AGEC685203 ATI + S 36450223 N Aug 30 SU8329 AGEC685348 ATI + S 34680135 N Sep 30 349450 AGEC685442 ATI + S

    Output:

    30380868 N Sep 29 356200 AGEC682569 ATI + S 71130740 N Sep 7 SM9481 AGEC683966 ATI + S 32450045 N Jul 14 SN9672 AGEC685203 ATI + S 36450223 N Aug 30 SU8329 AGEC685348 ATI + S 34680135 N Sep 30 349450 AGEC685442 ATI + S

    I want to print the one with "**" as 2nd column and ignore the rest of duplicates..

    Doesn't this raise the issue of the indistinguishability of identicals?

    The script prints only unique records. chomp is used in case the last line of data above doesn't end with a newline.

      Thank you!! If you notice, my data file is a space delimited file and the records are not exactly duplicates..
      30380868 N Sep 29 356200 AGEC682569 ATI + S 30380868 ** N Sep 29 356200 AGEC682569 ATI + S
      Im going to split the record and then, I will look for column1 which is here as -> (30380868). Then if you notice, I have 2nd column as ** and then empty.. All the other columns remain the same.. I want to print the record which has ** in the 2nd column.. Above code you provided only gives me 1st duplicate value which is
      30380868 N Sep 29 356200 AGEC682569 ATI + S
        my %seen; for ( reverse <DATA> ) { ( my $unstarred = $_ ) =~ s/\*\*/ /; print unless $seen{ $unstarred }++; } __DATA__ 30380868 N Sep 29 356200 AGEC682569 ATI + S 30380868 ** N Sep 29 356200 AGEC682569 ATI + S 71130740 N Sep 7 SM9481 AGEC683966 ATI + S 71130740 ** N Sep 7 SM9481 AGEC683966 ATI + S 32450045 N Jul 14 SN9672 AGEC685203 ATI + S 32450045 ** N Jul 14 SN9672 AGEC685203 ATI + S 36450223 N Aug 30 SU8329 AGEC685348 ATI + S 34680135 N Sep 30 349450 AGEC685442 ATI + S

        The only difference from your 'desired output' is that this code prints it out in reverse order. But I don't see this as a problem, as there seems to be no inherent order in the input/output anyway...

        My apologies, as I misunderstood. Try the following:

        use strict; use warnings; my %seen; while (<DATA>) { chomp; my ($col1) = /(\d+)/; $seen{$col1} = $_ if /\*\*/; $seen{$col1} = $_ if !exists $seen{$col1} or $seen{$col1} !~ /\*\*/; } print "$seen{$_}\n" for keys %seen; __DATA__ 11111111 ** N Sep 29 356200 AGEC682569 ATI + S 11111111 N Sep 29 356200 AGEC682569 ATI + S 30380868 N Sep 29 356200 AGEC682569 ATI + S 30380868 ** N Sep 29 356200 AGEC682569 ATI + S 71130740 N Sep 7 SM9481 AGEC683966 ATI + S 71130740 ** N Sep 7 SM9481 AGEC683966 ATI + S 32450045 N Jul 14 SN9672 AGEC685203 ATI + S 32450045 ** N Jul 14 SN9672 AGEC685203 ATI + S 36450223 N Aug 30 SU8329 AGEC685348 ATI + S 34680135 N Sep 30 349450 AGEC685442 ATI + S

        Output:

        71130740 ** N Sep 7 SM9481 AGEC683966 ATI + S 34680135 N Sep 30 349450 AGEC685442 ATI + S 32450045 ** N Jul 14 SN9672 AGEC685203 ATI + S 11111111 ** N Sep 29 356200 AGEC682569 ATI + S 36450223 N Aug 30 SU8329 AGEC685348 ATI + S 30380868 ** N Sep 29 356200 AGEC682569 ATI + S

        The above will preferentially keep starred records, regardless of the order records are processed.

Re: Selective printing of the Duplicates
by nikosv (Hermit) on Jan 31, 2013 at 12:41 UTC

    have tried doing that purely with SQL ?

    the having clause filters the duplicates while the use of the inline view filters using column2

    select * from ( select column1,count(*) from table group by column1 having (count(*)>1 ) ) a, table b where a.column1 = b.column1 and b.column2='**';

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1016111]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (14)
As of 2014-04-24 11:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (565 votes), past polls