http://www.perlmonks.org?node_id=871938


in reply to Remove duplicate entries

hehe... I see now where:
cp ($FH_A, $FH_B);
should be:
cp ($file_name_a, $file_name_b);

So here is what I came up with.
#!/usr/bin/perl use strict; use warnings; use autodie; my %seen; open my $FHIN, '<', $ARGV[0] or die $!; open my $FHNEW, '>', "$ARGV[0].new.csv" or die $!; open my $FHDEL, '>', "$ARGV[0].deleted.csv" or die $!; foreach my $line (<$FHIN>){ my ($key, $rest) = split/,/, $line, 2; $key =~ s/ [-&_+'] / /msx; $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx; ($seen{$key}++) ? print $FHDEL "DUP, $line" : print $FHNEW "$key,$rest"; } close $FHNEW, $FHDEL;
this works great if the search key is repeated. what if I have a key that is misspelled etc. i.e.:
___DATA___ Group Onne,Captain,Phone Number,League Pos,etc. Group Oneffdfadsf,Captain,Phone Number,League Pos,etc. GroupOneeroneouskunk,Captain,Phone Number,League Pos,etc. Group Two,Captain,Phone Number,League Pos,etc. Group Three,Captain,Phone Number,League Pos,etc.
where the first part of the name is correct but there is potentially more junk at the end of the name. is there a way to match part of the string and if part of the string matches call it a dup?
something like:
$seen{$key} =~ m/$key+,/ ? print DUP : print NEW;

the problem is the incoming data isn't consistent. ie there are ten cols across in the CSV, of the ten cols between 4 and 10 are filled in, so comparing the data is not a viable method for sorting DUP entries.
Stylistically, I've always used all caps to represent files, besides STDERR and STDOUT are just glorified file handles anyway, and they use full caps. I understand that lexically scoped file handles are not global variables and that's the differentiation you make--some habits.
Again, thanks for the assistance.

Replies are listed 'Best First'.
Re^2: Remove duplicate entries
by kcott (Archbishop) on Nov 17, 2010 at 07:52 UTC

    If your data is really that mangled, taking X characters from the front (e.g. with substr) isn't going to help. If, say, you take the first 8 characters so that "Group On" always goes into "Group One", what will you do with "Group Tw" (Two, Twelve, Twenty?).

    I suspect you're going to have to visually analyse your data and come up some sort of lookup table. Work through as much of your data as you can programmatically; outputting what can't be processed to a separate file. Then, based on what's left, either extend the lookup table or edit manually.

    -- Ken

      so my thought was to use an array to store the keys, then verify if that key exists in part or whole, then replace the part with the longest string. that solves the problem of taking arbitrary substr so "group tw" will not replace "group twelve" and "group two"...
      It's a good thought, I think, but there is a flaw in my logic, the loop just keeps repeating and growing. (I'm dealing with about 5k lines, I've seen $i hit over 20,000...)
      Here's what I cooked up:
      #!/usr/bin/perl use strict; use warnings; my %seen; my @gp_name = "INITIALIZING"; open my $FHIN, '<', $ARGV[0] or die $!; open my $FHNEW, '>', "$ARGV[0].newest2.csv" or die $!; open my $FHDEL, '>', "$ARGV[0].deleted2.csv" or die $!; LABEL: foreach my $line (<$FHIN>){ $line =~ s/["]//msx; my ($key, $rest) = split/,/, $line, 2; $key =~ s/ [-&_+'] / /msx; $key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx; for (my $i = 0; $i < @gp_name; $i++) { print "enter my \$i = $i\n "; if ($key =~ $gp_name[$i]){ if (($gp_name[$i] cmp $key) > 0 || ($gp_name[$i] cmp $key) == 0) { next LABEL; } elsif (($gp_name[$i] cmp $key) < 0) { $gp_name[$i] = $key; } } else { push @gp_name, $key; } }#end for ($seen{$key}++) ? print $FHDEL "DUP, $line" : print $FHNEW "$key,$rest"; }# end foreach close $FHNEW, $FHDEL;

        Your inner loop is iterating based on the number of elements in @gp_name but you are increasing the length of that array with push @gp_name, $key;.

        I suggested a lookup table and envisaged something like:

        my %group_table = ( 'On' => 'One', 'Two' => 'Two', 'Twel' => 'Twelve', 'Twen' => 'Twenty', ... );

        So, if "Group " is common to all keys, strip that off. Then take increasingly larger substrings from what's left until you get a match. Include some limit so when you've tried X characters and still found no match, give up and put that item in a separate "bucket" for manual intervention.

        -- Ken