Remove duplicate entries

PyrexKidd has asked for the wisdom of the Perl Monks concerning the following question:

I have a massive CSV file that I am trying to process to remove multiple entries.
Really I would like to have two files when I am done, the new dup free list and a list of removed entries.
The csv file is in this format:

Group One,Captain,Phone Number,League Pos,etc.
Group-One,Captain,Phone Number,League Pos,etc.
GroupOne,Captain,Phone Number,League Pos,etc.
Group Two,Captain,Phone Number,League Pos,etc.
Group Three,Captain,Phone Number,League Pos,etc.
Etc. Etc. Etc.

my thinking is to pull out the company name from each line then compare it to the rest of the file. I'm still new to regex and I'm running into multiple problems:

#!/usr/bin/perl
use strict;
use warnings;

open my $FHIN, '<', $ARGV[0];

open my $FHOUT, '>', "$ARGV[0].new";
open my $DELLIST, '>', "ARGV[0].deleted";

foreach (<$FHIN>){
    $_ =~ s/\"//g;
    $_ =~ m/(.*?),/i;
    my $tmp = $1;

    open my $SCRATCH, '<', "./scratch.pad";
    open my $TMPOUT, '>', "./tmp.out";
    foreach (<$SCRATCH>){
        $_ = m/$tmp/ ? print $DELLIST $_ : print $TMPOUT $_;
    }
    close $SCRATCH, $TMPOUT;
    cp ($TMPOUT, $SCRATCH);

}
[download]

First, there has to be a better way than iterating through the file for each line of the file.
Second, I'm not sure what I changed, but now I am receiving these two errors:

Filehandle $TMPOUT opened only for output at /usr/share/perl/5.10/File
+/Copy.pm line 200.
stat() on closed filehandle $SCRATCH at /usr/share/perl/5.10/File/Copy
+.pm line 117.
[download]

And finally, The dup entries with special characters are not counted as dups... Would doing a s/r for every special character ie:

$_ =~ s/[-|\&|_|+|']/ /g;
[download]

work to make all the entries similar enough?

Thanks in advance for the assist.

EDIT

in Lieu of:

$_ =~ s/[-|\&|_|+|']/ /g;
[download]

How about:

<c>
$_ =~ s/\W/ /g;
[download]

Comment on Remove duplicate entries Select or Download Code

Replies are listed 'Best First'.
Re: Remove duplicate entries by ikegami (Patriarch) on Nov 16, 2010 at 23:37 UTC
Normalise the key (remove the junk characters) and use it as a hash key. `my %seen; while (<>) { my @fields = ...; my $key = normalise($fields[0]); print { $seen{$key}++ ? $unique_fh : $extras_fh } $_; }` [download]	[reply] [d/l]
Re: Remove duplicate entries by kcott (Archbishop) on Nov 17, 2010 at 01:08 UTC
You're using the 3-argument form of open which is good but you're not checking whether it worked which is less good. Try doing something like the following whenever you open a file: `open my $file_handle, $mode, $filename or die $!;` [download] While it's a matter of personal style, I'd use lowercase characters for the file handle variables; leaving uppercase for global file handles such as `STDOUT`, `DATA`, etc. and constants. Regarding the two file-related errors you questioned: You opened `$TMPOUT` for writing only but then tried to read from it with `cp()`. You closed `$SCRATCH` then tried to use it in `cp()`. There's two issues with your regular expressions: You don't separate the characters in a character class with pipe characters: `[-&_+']` is all you need but do look at the Version 8 Regular Expressions section in perlre regarding the positioning of a hyphen character. By converting those special characters to spaces, you won't differentiate between Group One and GroupOne - you'll need to do something else here. Here's a piece of code that demonstrates what you want: #!perl use strict; use warnings; my %seen; while (my $line = <DATA>) { my ($key, $rest) = split /,/, $line, 2; $key =~ s{ [-&_+'] }{ }msx; $key =~ s{ ( [a-z] ) ( [A-Z] ) }{$1 $2}msx; if ($seen{$key}++) { print STDERR join '', 'DUP: ', $line; } else { print STDOUT join '', 'NEW: ', $key, ',', $rest; } } __DATA__ Group One,Captain,Phone Number,League Pos,etc. Group-One,Captain,Phone Number,League Pos,etc. GroupOne,Captain,Phone Number,League Pos,etc. Group Two,Captain,Phone Number,League Pos,etc. Group Three,Captain,Phone Number,League Pos,etc. [download] Here's the output: `$ del_dup_prob.pl NEW: Group One,Captain,Phone Number,League Pos,etc. DUP: Group-One,Captain,Phone Number,League Pos,etc. DUP: GroupOne,Captain,Phone Number,League Pos,etc. NEW: Group Two,Captain,Phone Number,League Pos,etc. NEW: Group Three,Captain,Phone Number,League Pos,etc.` [download] Commenting out the second regex (which converts GroupOne to Group One in `$key`) the output becomes: `$ del_dup_prob.pl NEW: Group One,Captain,Phone Number,League Pos,etc. DUP: Group-One,Captain,Phone Number,League Pos,etc. NEW: GroupOne,Captain,Phone Number,League Pos,etc. NEW: Group Two,Captain,Phone Number,League Pos,etc. NEW: Group Three,Captain,Phone Number,League Pos,etc.` [download] Obviously, you'll need to read in your input file and you won't want the NEW: and DUP: strings I added for illustrative purposes but otherwise you could run this code as: `script_name input_filename > unique_rows_filename 2> duplicate_rows_fi +lename` [download] Finally, given the list of special characters in your regex, there's more variety to your input data than you've shown here. That's fine, but you may need to tweak the regexes I've provided. -- Ken	[reply] [d/l] [select]
Re: Remove duplicate entries by aquarium (Curate) on Nov 16, 2010 at 23:47 UTC
if you're on unix/linux or have windows with cygwin installed..a combination of the usual suspects (utilities like sort, unix, join, diff) combined in a short pipeline would most likely achieve the desired outcome. it's not a perl solution, but then I don't like to code (and debug the code) when the result is easily achievable without even touching scripting. the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: Remove duplicate entries by aquarium (Curate) on Nov 17, 2010 at 02:52 UTC
and if there's dodgy input as indicated by someone, e.g. Group-One is equivalent to Group One and GroupOne...then just run it thru "sed" with a regex to make them all conform to a desired one form, before the de-dupe. the hardest line to type correctly is: stty erase ^H	[reply]
Re: Remove duplicate entries by 7stud (Deacon) on Nov 17, 2010 at 02:06 UTC
Whose to say what the correct team spelling is? If the data is the same for similarly spelled team names, how about comparing the data rather than the team name: my %good_data; my @dups; while (my $line = <DATA>) { my @fields = split /,/, $line, 2; my $team_info = $fields[1]; if (! $good_data{$team_info} ){ $good_data{$team_info} = $line; } else { push @dups, $line; } } for (values %good_data) { print; } print "" x 20, "\n"; for (@dups) { print; } __DATA__ Group One,Captain1,Phone Number,League Pos,etc. Group-One,Captain1,Phone Number,League Pos,etc. GroupOne,Captain1,Phone Number,League Pos,etc. Group Two,Captain2,Phone Number,League Pos,etc. Group Three,Captain3,Phone Number,League Pos,etc. --output:-- Group Three,Captain3,Phone Number,League Pos,etc. Group One,Captain1,Phone Number,League Pos,etc. Group Two,Captain2,Phone Number,League Pos,etc. ******************* Group-One,Captain1,Phone Number,League Pos,etc. GroupOne,Captain1,Phone Number,League Pos,etc. [download]	[reply] [d/l]
Re: Remove duplicate entries by PyrexKidd (Monk) on Nov 17, 2010 at 06:55 UTC
hehe... I see now where: `cp ($FH_A, $FH_B);` [download] should be: `cp ($file_name_a, $file_name_b);` [download] So here is what I came up with. `#!/usr/bin/perl use strict; use warnings; use autodie; my %seen; open my $FHIN, '<', $ARGV[0] or die $!; open my $FHNEW, '>', "$ARGV[0].new.csv" or die $!; open my $FHDEL, '>', "$ARGV[0].deleted.csv" or die $!; foreach my $line (<$FHIN>){ my ($key, $rest) = split/,/, $line, 2; $key =~ s/ [-&_+'] / /msx; $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx; ($seen{$key}++) ? print $FHDEL "DUP, $line" : print $FHNEW "$key,$rest"; } close $FHNEW, $FHDEL;` [download] this works great if the search key is repeated. what if I have a key that is misspelled etc. i.e.: `___DATA___ Group Onne,Captain,Phone Number,League Pos,etc. Group Oneffdfadsf,Captain,Phone Number,League Pos,etc. GroupOneeroneouskunk,Captain,Phone Number,League Pos,etc. Group Two,Captain,Phone Number,League Pos,etc. Group Three,Captain,Phone Number,League Pos,etc.` [download] where the first part of the name is correct but there is potentially more junk at the end of the name. is there a way to match part of the string and if part of the string matches call it a dup? something like: `$seen{$key} =~ m/$key+,/ ? print DUP : print NEW;` [download] the problem is the incoming data isn't consistent. ie there are ten cols across in the CSV, of the ten cols between 4 and 10 are filled in, so comparing the data is not a viable method for sorting DUP entries. Stylistically, I've always used all caps to represent files, besides STDERR and STDOUT are just glorified file handles anyway, and they use full caps. I understand that lexically scoped file handles are not global variables and that's the differentiation you make--some habits. Again, thanks for the assistance.	[reply] [d/l] [select]
Re^2: Remove duplicate entries by kcott (Archbishop) on Nov 17, 2010 at 07:52 UTC
If your data is really that mangled, taking X characters from the front (e.g. with substr) isn't going to help. If, say, you take the first 8 characters so that "Group On" always goes into "Group One", what will you do with "Group Tw" (Two, Twelve, Twenty?). I suspect you're going to have to visually analyse your data and come up some sort of lookup table. Work through as much of your data as you can programmatically; outputting what can't be processed to a separate file. Then, based on what's left, either extend the lookup table or edit manually. -- Ken	[reply]
Re^3: Remove duplicate entries by PyrexKidd (Monk) on Nov 17, 2010 at 16:06 UTC
so my thought was to use an array to store the keys, then verify if that key exists in part or whole, then replace the part with the longest string. that solves the problem of taking arbitrary substr so "group tw" will not replace "group twelve" and "group two"... It's a good thought, I think, but there is a flaw in my logic, the loop just keeps repeating and growing. (I'm dealing with about 5k lines, I've seen $i hit over 20,000...) Here's what I cooked up: #!/usr/bin/perl use strict; use warnings; my %seen; my @gp_name = "INITIALIZING"; open my $FHIN, '<', $ARGV[0] or die $!; open my $FHNEW, '>', "$ARGV[0].newest2.csv" or die $!; open my $FHDEL, '>', "$ARGV[0].deleted2.csv" or die $!; LABEL: foreach my $line (<$FHIN>){ $line =~ s/["]//msx; my ($key, $rest) = split/,/, $line, 2; $key =~ s/ [-&_+'] / /msx; $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx; for (my $i = 0; $i < @gp_name; $i++) { print "enter my \$i = $i\n "; if ($key =~ $gp_name[$i]){ if (($gp_name[$i] cmp $key) > 0 \|\| ($gp_name[$i] cmp $key) == 0) { next LABEL; } elsif (($gp_name[$i] cmp $key) < 0) { $gp_name[$i] = $key; } } else { push @gp_name, $key; } }#end for ($seen{$key}++) ? print $FHDEL "DUP, $line" : print $FHNEW "$key,$rest"; }# end foreach close $FHNEW, $FHDEL; [download]	[reply] [d/l]
Re^4: Remove duplicate entries by kcott (Archbishop) on Nov 17, 2010 at 17:11 UTC
Re: Remove duplicate entries by chrestomanci (Priest) on Nov 17, 2010 at 09:50 UTC
I know this is off topic, but I got this exact question as a perl programming test at a job interview for a senior perl developer. (There where other harder problems as well). I found it easy, and got the job. I found out afterwards that some of the other candidates failed to answer this question, despite claiming many years of perl experience.	[reply]

Back to Seekers of Perl Wisdom