http://www.perlmonks.org?node_id=660214

nashkab has asked for the wisdom of the Perl Monks concerning the following question:

I Have a file which looks like this:-
COMPUTER DISTRIBUTION_ID STATUS </br> 30F-WKS `1781183799.xxxx1' IC---</br> 30F-WKS `1781183799.xxx11' IC---</br> ADM34A3F9 `1781183799.41455' IC---</br>

I want to remove duplicate entries for one single instance of COMPUTER column.

So the output file will look like this:-

COMPUTER DISTRIBUTION_ID STATUS</br> 30F-WKS `1781183799.xxx11' IC---</br> ADM34A3F9 `1781183799.41455' IC---</br>

I have used the following code:-

open(FILE2,">file1.txt")|| warn "Could not open\n"; open(FILE3,"file2.txt")|| warn "Could not open\n"; my $Previous = ""; my @data = <FILE3>; $index=0; foreach $_data (@data) { $index++; chomp ($_data); @Current = split(/\t/, $_data); @Previous = split(/\t/, $Previous); if (@Current[0] ne @Previous[0]) { if ($index == 1) { # do nothing. } else { print FILE2 $Previous; } } else {} $Previous = $_data; } close(FILE2); close(FILE3);

However the duplicates are not getting removed. please help me with the correct code

UPDATE - I NEED THE FINAL ENTRY FOR A DUPLICATE ENTRY

Replies are listed 'Best First'.
Re: Remove Duplicates!! Please Help
by kirillm (Friar) on Jan 03, 2008 at 15:37 UTC

    A one-liner:

    perl -lnae 'print unless $seen{$F[0]}++' < file2.txt > file1.txt

    Update: The OP needed the last entry if there are duplicates, the above solution took the first entry. Here's an alternative solutionthat requires module Tie::Hash::Indexed to be installed:

    $ perl -MTie::Hash::Indexed -lane '\ sub BEGIN {tie %seen, "Tie::Hash::Indexed"};\ sub END {print $seen{$_} for keys %seen};\ $seen{$F[0]} = $_' file2.txt > file1.txt
      Marginally less overhead if the data is sorted:
      perl -lane 'print unless $F[0] eq $prev;$prev=$F[0]' < file2.txt > fi +le1.txt
      Also, the OP's code handles the boundary condition for the first record, but not the last.

           "As you get older three things happen. The first is your memory goes, and I can't remember the other two... " - Sir Norman Wisdom

      This one-liner would not give the same results as requested: it prints the first entry found for a given hostname, though per the example it should be printing the last one (unless there's a typo and the 1781183799.xxx11 should be 1781183799.xxxx1 as the first entry).
Re: Remove Duplicates!! Please Help
by davidrw (Prior) on Jan 03, 2008 at 15:46 UTC
    Your solution works, except for stripping newlines (because of the chomp). If you just change the print FILE2 line to:
    print FILE2 $Previous, "\n";
    Then it will do what you want.

    Here's an alternate command-line solution:
    perl -ane 'print unless $seen{$F[0]}++' /tmp/data
    See perlrun for the -ane options. Here's what it looks like in use:
    [me@host me]$ cat /tmp/data COMPUTER DISTRIBUTION_ID STATUS </br> 30F-WKS `1781183799.xxxx1' IC---</br> 30F-WKS `1781183799.xxx11' IC---</br> ADM34A3F9 `1781183799.41455' IC---</br> [me@host me]$ perl -ane 'print unless $seen{$F[0]}++' /tmp/data COMPUTER DISTRIBUTION_ID STATUS </br> 30F-WKS `1781183799.xxxx1' IC---</br> ADM34A3F9 `1781183799.41455' IC---</br>
Re: Remove Duplicates!! Please Help
by cdarke (Prior) on Jan 03, 2008 at 16:00 UTC
    Or try this:
    #!/usr/bin/perl use warnings; use strict; open(FILE2,">file1.txt")|| die "Could not open: $!\n"; open(FILE3,"file2.txt")|| die "Could not open: $!\n"; my @data = <FILE3>; close(FILE3); my %hash; foreach my $_data (@data) { my $computer_name = (split(/\t/, $_data))[0]; if (! exists($hash{$computer_name})) { $hash{$computer_name} = undef; print FILE2 $_data; } } close(FILE2);
Re: Remove Duplicates!! Please Help
by ww (Archbishop) on Jan 03, 2008 at 15:40 UTC
    You probably should do this by using a hash, but since the two items to which you refer are NOT "duplicates," perhaps you could clarify a bit.

    Update: lagged again. s/using a hash,/using a hash, perhaps as above,/

      OP said "I want to remove duplicate entries for one single instance of COMPUTER column." So, for OP's purpose's, the two items are dups -- both have the first (Computer) column value of '30F-WKS' .. which leads right into to your comment of needing a hash (e.g. see the one-liner solutions)
Re: Remove Duplicates!! Please Help
by jbert (Priest) on Jan 03, 2008 at 15:42 UTC
    If the lines are adjacent the unix tool 'uniq' does this job:
    uniq input_file > output_file
    If they're not adjacent, you can sort the file first (unless, as the line endings suggest, there is other structure to the file such as an HTML or XML header). This is so useful that sort has it as an option (-u), so you don't need to pipe to uniq:
    sort -u input_file > output_file.
      While i love uniq, it's not a solution here. OP said "I want to remove duplicate entries for one single instance of COMPUTER column." Not eliminate duplicate lines, which your uniq examples doe.

      My first action after reading OP was to man uniq -- there's options to "avoid comparing the first N fields" and "avoid comparing the first N characters", but unfortunately neither of those work for comparing just the first column (of a tab-delim'd file).
        Fair point, sorry. I misread the sample data and gave an alternative to the (also incorrect) perl version above. Thanks for the catch.