Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^3: Remove duplicate entries

by PyrexKidd (Monk)
on Nov 17, 2010 at 16:06 UTC ( #872012=note: print w/replies, xml ) Need Help??

in reply to Re^2: Remove duplicate entries
in thread Remove duplicate entries

so my thought was to use an array to store the keys, then verify if that key exists in part or whole, then replace the part with the longest string. that solves the problem of taking arbitrary substr so "group tw" will not replace "group twelve" and "group two"...
It's a good thought, I think, but there is a flaw in my logic, the loop just keeps repeating and growing. (I'm dealing with about 5k lines, I've seen $i hit over 20,000...)
Here's what I cooked up:
#!/usr/bin/perl use strict; use warnings; my %seen; my @gp_name = "INITIALIZING"; open my $FHIN, '<', $ARGV[0] or die $!; open my $FHNEW, '>', "$ARGV[0].newest2.csv" or die $!; open my $FHDEL, '>', "$ARGV[0].deleted2.csv" or die $!; LABEL: foreach my $line (<$FHIN>){ $line =~ s/["]//msx; my ($key, $rest) = split/,/, $line, 2; $key =~ s/ [-&_+'] / /msx; $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx; for (my $i = 0; $i < @gp_name; $i++) { print "enter my \$i = $i\n "; if ($key =~ $gp_name[$i]){ if (($gp_name[$i] cmp $key) > 0 || ($gp_name[$i] cmp $key) == 0) { next LABEL; } elsif (($gp_name[$i] cmp $key) < 0) { $gp_name[$i] = $key; } } else { push @gp_name, $key; } }#end for ($seen{$key}++) ? print $FHDEL "DUP, $line" : print $FHNEW "$key,$rest"; }# end foreach close $FHNEW, $FHDEL;

Replies are listed 'Best First'.
Re^4: Remove duplicate entries
by kcott (Chancellor) on Nov 17, 2010 at 17:11 UTC

    Your inner loop is iterating based on the number of elements in @gp_name but you are increasing the length of that array with push @gp_name, $key;.

    I suggested a lookup table and envisaged something like:

    my %group_table = ( 'On' => 'One', 'Two' => 'Two', 'Twel' => 'Twelve', 'Twen' => 'Twenty', ... );

    So, if "Group " is common to all keys, strip that off. Then take increasingly larger substrings from what's left until you get a match. Include some limit so when you've tried X characters and still found no match, give up and put that item in a separate "bucket" for manual intervention.

    -- Ken

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://872012]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2017-09-26 01:31 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (291 votes). Check out past polls.