Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: Remove duplicate entries

by kcott (Abbot)
on Nov 17, 2010 at 07:52 UTC ( #871950=note: print w/ replies, xml ) Need Help??


in reply to Re: Remove duplicate entries
in thread Remove duplicate entries

If your data is really that mangled, taking X characters from the front (e.g. with substr) isn't going to help. If, say, you take the first 8 characters so that "Group On" always goes into "Group One", what will you do with "Group Tw" (Two, Twelve, Twenty?).

I suspect you're going to have to visually analyse your data and come up some sort of lookup table. Work through as much of your data as you can programmatically; outputting what can't be processed to a separate file. Then, based on what's left, either extend the lookup table or edit manually.

-- Ken


Comment on Re^2: Remove duplicate entries
Re^3: Remove duplicate entries
by PyrexKidd (Monk) on Nov 17, 2010 at 16:06 UTC
    so my thought was to use an array to store the keys, then verify if that key exists in part or whole, then replace the part with the longest string. that solves the problem of taking arbitrary substr so "group tw" will not replace "group twelve" and "group two"...
    It's a good thought, I think, but there is a flaw in my logic, the loop just keeps repeating and growing. (I'm dealing with about 5k lines, I've seen $i hit over 20,000...)
    Here's what I cooked up:
    #!/usr/bin/perl use strict; use warnings; my %seen; my @gp_name = "INITIALIZING"; open my $FHIN, '<', $ARGV[0] or die $!; open my $FHNEW, '>', "$ARGV[0].newest2.csv" or die $!; open my $FHDEL, '>', "$ARGV[0].deleted2.csv" or die $!; LABEL: foreach my $line (<$FHIN>){ $line =~ s/["]//msx; my ($key, $rest) = split/,/, $line, 2; $key =~ s/ [-&_+'] / /msx; $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx; for (my $i = 0; $i < @gp_name; $i++) { print "enter my \$i = $i\n "; if ($key =~ $gp_name[$i]){ if (($gp_name[$i] cmp $key) > 0 || ($gp_name[$i] cmp $key) == 0) { next LABEL; } elsif (($gp_name[$i] cmp $key) < 0) { $gp_name[$i] = $key; } } else { push @gp_name, $key; } }#end for ($seen{$key}++) ? print $FHDEL "DUP, $line" : print $FHNEW "$key,$rest"; }# end foreach close $FHNEW, $FHDEL;

      Your inner loop is iterating based on the number of elements in @gp_name but you are increasing the length of that array with push @gp_name, $key;.

      I suggested a lookup table and envisaged something like:

      my %group_table = ( 'On' => 'One', 'Two' => 'Two', 'Twel' => 'Twelve', 'Twen' => 'Twenty', ... );

      So, if "Group " is common to all keys, strip that off. Then take increasingly larger substrings from what's left until you get a match. Include some limit so when you've tried X characters and still found no match, give up and put that item in a separate "bucket" for manual intervention.

      -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://871950]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (7)
As of 2014-08-01 02:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (256 votes), past polls