Remove duplicate lines in a file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

HI, I have a tab separated file which may run upto 5000 lines. The file format is some thing like this:

XXXXXS331632    XXXXXS331632    female  40087   a5
XXXXXS331632    XXXXXS331632    female  47735   a5
XXXXXS331681    XXXXXS331681    male    40087   e6
XXXXXS331681    XXXXXS331681    male    47735   e6
XXXXXS331856    XXXXXS331856    male    40177   d1
XXXXXS331856    XXXXXS331856    male    47737   d1
[download]

What I really want to do is delete the row that appears twice irrespective of the difference(40087 , 47735) in the 4th column. I could remove either the first or the the second entry. At the end what I like to have is a file with the duplicate(?) entry removed.
Something like this:

XXXXXS331632    XXXXXS331632    female  40087   a5
XXXXXS331681    XXXXXS331681    male    40087   e6
XXXXXS331856    XXXXXS331856    male    40177   d1
[download]

Any suggestions please
Thanks for your time.

Comment on Remove duplicate lines in a file Select or Download Code

Replies are listed 'Best First'.
Re: Remove duplicate lines in a file by RhetTbull (Curate) on Nov 05, 2008 at 16:16 UTC
If the duplicate records will always be grouped together, you could do something like the following to keep track of the last record you've seen. I'm assuming that the first column is the key you care about. If you really care about the first 3 columns, you'll have to modify accordingly. `use strict; use warnings; my $last = ''; while(<>) { my @columns = split; next if $columns[0] eq $last; $last = $columns[0]; print; }` [download] If the duplicate records don't necessarily follow each other, then use a hash to determine which ones you've already seen. `use strict; use warnings; my %seen; while (<>) { my @columns = split; next if exists $seen{$columns[0]}; $seen{$columns[0]} = 1; print; }` [download]	[reply] [d/l] [select]
Re^2: Remove duplicate lines in a file by Anonymous Monk on Nov 05, 2008 at 17:10 UTC
Thanks! that was great. But just a quick thought if at all I like to remove the first entry (40087) in some case? Do I need to sort the file first by the 4th column and proceed ? Or is there any better way of doing it? Once again, Thanks a lot for your reply.	[reply]
Re^3: Remove duplicate lines in a file by RhetTbull (Curate) on Nov 05, 2008 at 18:14 UTC
I'm not sure I understand the question. I think you're asking how to print the last entry instead of the first. This code should do that: `use strict; use warnings; my $last_key = undef; my $last_line = <>; #get first line while(<>) { my $key = (split)[0]; if (defined $last_key && $key ne $last_key) { #new key, print the last line from the old key print $last_line; } $last_line = $_; $last_key = $key; } print $last_line; #very last entry won't get printed in the while loop` [download]	[reply] [d/l]
Re: Remove duplicate lines in a file by lostjimmy (Chaplain) on Nov 05, 2008 at 16:13 UTC
This works: `my %seen; my @lines; while (<DATA>) { my @cols = split /\s+/; unless ($seen{$cols[0]}++) { push @lines, $_; } } print @lines; __DATA__ XXXXXS331632 XXXXXS331632 female 40087 a5 XXXXXS331632 XXXXXS331632 female 47735 a5 XXXXXS331681 XXXXXS331681 male 40087 e6 XXXXXS331681 XXXXXS331681 male 47735 e6 XXXXXS331856 XXXXXS331856 male 40177 d1 XXXXXS331856 XXXXXS331856 male 47737 d1` [download] Output: `$ ./721691.pl XXXXXS331632 XXXXXS331632 female 40087 a5 XXXXXS331681 XXXXXS331681 male 40087 e6 XXXXXS331856 XXXXXS331856 male 40177 d1` [download] Edit: Misread the question and used the wrong column for the ID.	[reply] [d/l] [select]
Re: Remove duplicate lines in a file by graff (Chancellor) on Nov 05, 2008 at 23:55 UTC
I needed a general solution to this sort of problem, where I might need to eliminate lines with duplicate values in column 3 of one file, column 8 in some other file, and the combination of columns 2 and 6 in yet another file. I also wanted to choose between keeping only the first line containing duplicated content, vs. keeping only the last line. I just posted it here at the Monastery: col-uniq -- remove lines that match on selected column(s)	[reply]

Back to Seekers of Perl Wisdom