Remove Duplicates!! Please Help

nashkab has asked for the wisdom of the Perl Monks concerning the following question:

I Have a file which looks like this:-

COMPUTER      DISTRIBUTION_ID    STATUS    </br>
30F-WKS     `1781183799.xxxx1'      IC---</br>
30F-WKS     `1781183799.xxx11'      IC---</br>
ADM34A3F9   `1781183799.41455'      IC---</br>
[download]

I want to remove duplicate entries for one single instance of COMPUTER column.

So the output file will look like this:-

COMPUTER      DISTRIBUTION_ID    STATUS</br>
30F-WKS     `1781183799.xxx11'      IC---</br>
ADM34A3F9   `1781183799.41455'      IC---</br>
[download]

I have used the following code:-

open(FILE2,">file1.txt")|| warn "Could not open\n";

open(FILE3,"file2.txt")|| warn "Could not open\n";

my $Previous = "";
my @data = <FILE3>;

    $index=0;
    foreach $_data (@data) 
    {
        $index++;
        chomp ($_data);
                @Current = split(/\t/, $_data);
        @Previous = split(/\t/, $Previous);
        if (@Current[0] ne @Previous[0])
        {
            if ($index == 1)
            {    
                # do nothing.    
            }
            else
            {
                print FILE2 $Previous;
            }

        }

                else {} 

        $Previous = $_data;
                 
        }

close(FILE2);
close(FILE3);
[download]

However the duplicates are not getting removed. please help me with the correct code

UPDATE - I NEED THE FINAL ENTRY FOR A DUPLICATE ENTRY

Comment on Remove Duplicates!! Please Help Select or Download Code

Replies are listed 'Best First'.
Re: Remove Duplicates!! Please Help by kirillm (Friar) on Jan 03, 2008 at 15:37 UTC
A one-liner: `perl -lnae 'print unless $seen{$F[0]}++' < file2.txt > file1.txt` [download] Update: The OP needed the last entry if there are duplicates, the above solution took the first entry. Here's an alternative solutionthat requires module Tie::Hash::Indexed to be installed: `$ perl -MTie::Hash::Indexed -lane '\ sub BEGIN {tie %seen, "Tie::Hash::Indexed"};\ sub END {print $seen{$_} for keys %seen};\ $seen{$F[0]} = $_' file2.txt > file1.txt` [download]	[reply] [d/l] [select]
Re^2: Remove Duplicates!! Please Help by NetWallah (Canon) on Jan 03, 2008 at 16:16 UTC
Marginally less overhead if the data is sorted: `perl -lane 'print unless $F[0] eq $prev;$prev=$F[0]' < file2.txt > fi +le1.txt` [download] Also, the OP's code handles the boundary condition for the first record, but not the last. "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom	[reply] [d/l]
Re^2: Remove Duplicates!! Please Help by alexm (Chaplain) on Jan 03, 2008 at 20:12 UTC
This one-liner would not give the same results as requested: it prints the first entry found for a given hostname, though per the example it should be printing the last one (unless there's a typo and the `1781183799.xxx11` should be `1781183799.xxxx1` as the first entry).	[reply] [d/l] [select]
Re: Remove Duplicates!! Please Help by davidrw (Prior) on Jan 03, 2008 at 15:46 UTC
Your solution works, except for stripping newlines (because of the `chomp`). If you just change the `print FILE2` line to: `print FILE2 $Previous, "\n";` [download] Then it will do what you want. Here's an alternate command-line solution: `perl -ane 'print unless $seen{$F[0]}++' /tmp/data` [download] See perlrun for the -ane options. Here's what it looks like in use: [me@host me]$ cat /tmp/data COMPUTER DISTRIBUTION_ID STATUS </br> 30F-WKS `1781183799.xxxx1' IC---</br> 30F-WKS `1781183799.xxx11' IC---</br> ADM34A3F9 `1781183799.41455' IC---</br> [me@host me]$ perl -ane 'print unless $seen{$F[0]}++' /tmp/data COMPUTER DISTRIBUTION_ID STATUS </br> 30F-WKS `1781183799.xxxx1' IC---</br> ADM34A3F9 `1781183799.41455' IC---</br> [download]	[reply] [d/l] [select]
Re: Remove Duplicates!! Please Help by cdarke (Prior) on Jan 03, 2008 at 16:00 UTC
Or try this: `#!/usr/bin/perl use warnings; use strict; open(FILE2,">file1.txt")\|\| die "Could not open: $!\n"; open(FILE3,"file2.txt")\|\| die "Could not open: $!\n"; my @data = <FILE3>; close(FILE3); my %hash; foreach my $_data (@data) { my $computer_name = (split(/\t/, $_data))[0]; if (! exists($hash{$computer_name})) { $hash{$computer_name} = undef; print FILE2 $_data; } } close(FILE2);` [download]	[reply] [d/l]
Re: Remove Duplicates!! Please Help by ww (Archbishop) on Jan 03, 2008 at 15:40 UTC
You probably should do this by using a hash, but since the two items to which you refer are NOT "duplicates," perhaps you could clarify a bit. Update: lagged again. s/using a hash,/using a hash, perhaps as above,/	[reply]
Re^2: Remove Duplicates!! Please Help by davidrw (Prior) on Jan 03, 2008 at 15:54 UTC
OP said "I want to remove duplicate entries for one single instance of COMPUTER column." So, for OP's purpose's, the two items are dups -- both have the first (Computer) column value of '30F-WKS' .. which leads right into to your comment of needing a hash (e.g. see the one-liner solutions)	[reply]
Re: Remove Duplicates!! Please Help by jbert (Priest) on Jan 03, 2008 at 15:42 UTC
If the lines are adjacent the unix tool 'uniq' does this job: `uniq input_file > output_file` [download] If they're not adjacent, you can sort the file first (unless, as the line endings suggest, there is other structure to the file such as an HTML or XML header). This is so useful that sort has it as an option (-u), so you don't need to pipe to uniq: `sort -u input_file > output_file.` [download]	[reply] [d/l] [select]
Re^2: Remove Duplicates!! Please Help by davidrw (Prior) on Jan 03, 2008 at 15:58 UTC
While i love `uniq`, it's not a solution here. OP said "I want to remove duplicate entries for one single instance of COMPUTER column." Not eliminate duplicate lines, which your `uniq` examples doe. My first action after reading OP was to man uniq -- there's options to "avoid comparing the first N fields" and "avoid comparing the first N characters", but unfortunately neither of those work for comparing just the first column (of a tab-delim'd file).	[reply] [d/l] [select]
Re^3: Remove Duplicates!! Please Help by jbert (Priest) on Jan 03, 2008 at 16:00 UTC
Fair point, sorry. I misread the sample data and gave an alternative to the (also incorrect) perl version above. Thanks for the catch.	[reply]

Back to Seekers of Perl Wisdom