move the line if particular column is duplicate or more than 1 entries

hyans.milis has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

need your help how to move the line if 2nd,& 3rd column are duplicate, compare the value on 6th column. if it's lesser than previous line(6th column), move the line into different files

Input

628801844415 510998000000015 19    22 0 NULL 0
628802944409 510998000000109 4 22  0 NULL 0
628802544405 510998000000205 4    22 0 NULL 0
628802544417 510998000000217 19    22 0 213 0
628802644413 510998000000313 19    22 0 123 0
628802644417 510998000000217 19 22 0 345 0
[download]

output File 1

628801844415 510998000000015 19    22 0 NULL 0
628802944409 510998000000109 4 22  0 NULL 0
628802544405 510998000000205 4    22 0 NULL 0
628802644413 510998000000313 19    22 0 123 0
628802644417 510998000000217 19 22 0 345 0
[download]

output file2

628802544417 510998000000217 19    22 0 213 0
[download]

#!/usr/bin/perl
use strict;
use warnings;
 
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n
+";
my $outfile = $ARGV[1] or die "Need to get output file on the command 
+line\n";


my $sum = 0;
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
open (OUTFILE, "> $outfile") || die "ERROR: opening $outfile\n";
open (OUTFILE_1, "> dup_$outfile") || die "ERROR: opening $outfile\n";
 
while (my $line = <$data>) {
  chomp $line;
  
  my @fields = split "," , $line, -1;;
             
        if (!$a{$fields[1],$fields[2]}++)  {
            $line= join( ",", @fields ) . "\n" ;
            print OUTFILE ("$line");
        }
        else {
           
             $line= join( ",", @fields ) . "\n" ;
            print OUTFILE_1 ("$line");
         
        }
}


close ($data);
close (OUTFILE);
close (OUTFILE_1);
[download]

Comment on move the line if particular column is duplicate or more than 1 entries Select or Download Code

Replies are listed 'Best First'.
Re: move the line if particular column is duplicate or more than 1 entries by Laurent_R (Canon) on Mar 17, 2013 at 21:49 UTC
To remove duplicates, just store second and third fields in a hash (as a key) when you read the line. Then, print in file 1 or in file 2 depending on whether the key already exists in the hash. Something like this (untested): `my %already_seen; while (<>) { my @temp = split /\s\, $_; my $key = $temp[1,2]; if defined ($already_seen{$key} { # it is a duplicate print $output2 $_; } else { $already_seen{$key} = 1; print $output2 $_; } }` [download] Code could be made more concise and the @temp array could be avoided for example with such a syntax: `$key = join " ", (split / /, $_) [1,2];` but I preferred to do it simpler to understand. Note that this will not work if your input file is huge to the point of getting your hash too large for your memory. The limit will depend on your system, but in general, everything below a million lines should probably work without problem on most current systems. If your file is larger (especially if it is much larger), you might need a different way of doing it. Similarly, for the order of the 6th column, just store its value in a variable when you read a line and compare 6th column with the variable (i.e. 6th column of previous line). I leave it to you to put the two rules together.	[reply] [d/l] [select]
Re: move the line if particular column is duplicate or more than 1 entries by poj (Abbot) on Mar 17, 2013 at 18:08 UTC
I see that output 2 has the duplicate record where the column 6 value 213 is less than 345 except that value is a later line not a previous line. To compare the value of previous and later lines you either have to scan the file twice or read all the lines into a structure and then create the output files. This example scans twice # hash to hold highest values my %col6=(); while (my $line = <$data>) { chomp $line; my @fields = split "," , $line, -1; my $key = $fields[1].$fields[2]; # store max values if ( $fields[5] > $col6{$key} ){ $col6{$key} = $fields[5]; } } # reset to start seek $data,0,0; # read file 2nd time while (my $line = <$data>) { chomp $line; my @fields = split "," , $line, -1; my $key = $fields[1].$fields[2]; # reject lowest duplicate if ( $fields[5] < $col6{$key} ){ # extra text added for debugging print OUTFILE_1 $line." - duplicate $key $col6{$key}\n"; } else { print OUTFILE $line."\n"; } } [download] Update : This simple example assumes column 6 values are never negative. poj	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks