http://www.perlmonks.org?node_id=721691

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

HI, I have a tab separated file which may run upto 5000 lines. The file format is some thing like this:
XXXXXS331632 XXXXXS331632 female 40087 a5 XXXXXS331632 XXXXXS331632 female 47735 a5 XXXXXS331681 XXXXXS331681 male 40087 e6 XXXXXS331681 XXXXXS331681 male 47735 e6 XXXXXS331856 XXXXXS331856 male 40177 d1 XXXXXS331856 XXXXXS331856 male 47737 d1
What I really want to do is delete the row that appears twice irrespective of the difference(40087 , 47735) in the 4th column. I could remove either the first or the the second entry. At the end what I like to have is a file with the duplicate(?) entry removed.
Something like this:
XXXXXS331632 XXXXXS331632 female 40087 a5 XXXXXS331681 XXXXXS331681 male 40087 e6 XXXXXS331856 XXXXXS331856 male 40177 d1
Any suggestions please
Thanks for your time.

Replies are listed 'Best First'.
Re: Remove duplicate lines in a file
by RhetTbull (Curate) on Nov 05, 2008 at 16:16 UTC
    If the duplicate records will always be grouped together, you could do something like the following to keep track of the last record you've seen. I'm assuming that the first column is the key you care about. If you really care about the first 3 columns, you'll have to modify accordingly.
    use strict; use warnings; my $last = ''; while(<>) { my @columns = split; next if $columns[0] eq $last; $last = $columns[0]; print; }
    If the duplicate records don't necessarily follow each other, then use a hash to determine which ones you've already seen.
    use strict; use warnings; my %seen; while (<>) { my @columns = split; next if exists $seen{$columns[0]}; $seen{$columns[0]} = 1; print; }
      Thanks! that was great. But just a quick thought if at all I like to remove the first entry (40087) in some case? Do I need to sort the file first by the 4th column and proceed ? Or is there any better way of doing it? Once again, Thanks a lot for your reply.
        I'm not sure I understand the question. I think you're asking how to print the last entry instead of the first. This code should do that:
        use strict; use warnings; my $last_key = undef; my $last_line = <>; #get first line while(<>) { my $key = (split)[0]; if (defined $last_key && $key ne $last_key) { #new key, print the last line from the old key print $last_line; } $last_line = $_; $last_key = $key; } print $last_line; #very last entry won't get printed in the while loop
Re: Remove duplicate lines in a file
by lostjimmy (Chaplain) on Nov 05, 2008 at 16:13 UTC
    This works:
    my %seen; my @lines; while (<DATA>) { my @cols = split /\s+/; unless ($seen{$cols[0]}++) { push @lines, $_; } } print @lines; __DATA__ XXXXXS331632 XXXXXS331632 female 40087 a5 XXXXXS331632 XXXXXS331632 female 47735 a5 XXXXXS331681 XXXXXS331681 male 40087 e6 XXXXXS331681 XXXXXS331681 male 47735 e6 XXXXXS331856 XXXXXS331856 male 40177 d1 XXXXXS331856 XXXXXS331856 male 47737 d1
    Output:
    $ ./721691.pl XXXXXS331632 XXXXXS331632 female 40087 a5 XXXXXS331681 XXXXXS331681 male 40087 e6 XXXXXS331856 XXXXXS331856 male 40177 d1
    Edit: Misread the question and used the wrong column for the ID.
Re: Remove duplicate lines in a file
by graff (Chancellor) on Nov 05, 2008 at 23:55 UTC
    I needed a general solution to this sort of problem, where I might need to eliminate lines with duplicate values in column 3 of one file, column 8 in some other file, and the combination of columns 2 and 6 in yet another file. I also wanted to choose between keeping only the first line containing duplicated content, vs. keeping only the last line. I just posted it here at the Monastery: col-uniq -- remove lines that match on selected column(s)