Re^2: Remove duplicate entries

If your data is really that mangled, taking X characters from the front (e.g. with substr) isn't going to help. If, say, you take the first 8 characters so that "Group On" always goes into "Group One", what will you do with "Group Tw" (Two, Twelve, Twenty?).

I suspect you're going to have to visually analyse your data and come up some sort of lookup table. Work through as much of your data as you can programmatically; outputting what can't be processed to a separate file. Then, based on what's left, either extend the lookup table or edit manually.

-- Ken

Comment on Re^2: Remove duplicate entries

Replies are listed 'Best First'.
Re^3: Remove duplicate entries by PyrexKidd (Monk) on Nov 17, 2010 at 16:06 UTC
so my thought was to use an array to store the keys, then verify if that key exists in part or whole, then replace the part with the longest string. that solves the problem of taking arbitrary substr so "group tw" will not replace "group twelve" and "group two"... It's a good thought, I think, but there is a flaw in my logic, the loop just keeps repeating and growing. (I'm dealing with about 5k lines, I've seen $i hit over 20,000...) Here's what I cooked up: #!/usr/bin/perl use strict; use warnings; my %seen; my @gp_name = "INITIALIZING"; open my $FHIN, '<', $ARGV[0] or die $!; open my $FHNEW, '>', "$ARGV[0].newest2.csv" or die $!; open my $FHDEL, '>', "$ARGV[0].deleted2.csv" or die $!; LABEL: foreach my $line (<$FHIN>){ $line =~ s/["]//msx; my ($key, $rest) = split/,/, $line, 2; $key =~ s/ [-&_+'] / /msx; $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx; for (my $i = 0; $i < @gp_name; $i++) { print "enter my \$i = $i\n "; if ($key =~ $gp_name[$i]){ if (($gp_name[$i] cmp $key) > 0 \|\| ($gp_name[$i] cmp $key) == 0) { next LABEL; } elsif (($gp_name[$i] cmp $key) < 0) { $gp_name[$i] = $key; } } else { push @gp_name, $key; } }#end for ($seen{$key}++) ? print $FHDEL "DUP, $line" : print $FHNEW "$key,$rest"; }# end foreach close $FHNEW, $FHDEL; [download]	[reply] [d/l]
Re^4: Remove duplicate entries by kcott (Archbishop) on Nov 17, 2010 at 17:11 UTC
Your inner loop is iterating based on the number of elements in `@gp_name` but you are increasing the length of that array with `push @gp_name, $key;`. I suggested a lookup table and envisaged something like: `my %group_table = ( 'On' => 'One', 'Two' => 'Two', 'Twel' => 'Twelve', 'Twen' => 'Twenty', ... );` [download] So, if "Group " is common to all keys, strip that off. Then take increasingly larger substrings from what's left until you get a match. Include some limit so when you've tried X characters and still found no match, give up and put that item in a separate "bucket" for manual intervention. -- Ken	[reply] [d/l] [select]


Just another Perl shrine
	PerlMonks