so my thought was to use an array to store the keys, then verify if that key exists in part or whole, then replace the part with the longest string. that solves the problem of taking arbitrary substr so "group tw" will not replace "group twelve" and "group two"...
It's a good thought, I think, but there is a flaw in my logic, the loop just keeps repeating and growing. (I'm dealing with about 5k lines, I've seen $i hit over 20,000...)
Here's what I cooked up:
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
my @gp_name = "INITIALIZING";
open my $FHIN, '<', $ARGV[0] or die $!;
open my $FHNEW, '>', "$ARGV[0].newest2.csv" or die $!;
open my $FHDEL, '>', "$ARGV[0].deleted2.csv" or die $!;
LABEL:
foreach my $line (<$FHIN>){
$line =~ s/["]//msx;
my ($key, $rest) = split/,/, $line, 2;
$key =~ s/ [-&_+'] / /msx;
$key =~ s/ ( [a-z] ) ( [A-Z] )/$1 $2/msx;
for (my $i = 0; $i < @gp_name; $i++) {
print "enter my \$i = $i\n ";
if ($key =~ $gp_name[$i]){
if (($gp_name[$i] cmp $key) > 0
|| ($gp_name[$i] cmp $key) == 0) {
next LABEL;
} elsif (($gp_name[$i] cmp $key) < 0) {
$gp_name[$i] = $key;
}
} else {
push @gp_name, $key;
}
}#end for
($seen{$key}++) ?
print $FHDEL "DUP, $line" :
print $FHNEW "$key,$rest";
}# end foreach
close $FHNEW, $FHDEL;
| [reply] [d/l] |
Your inner loop is iterating based on the number of elements in @gp_name but you are increasing the length of that array with push @gp_name, $key;.
I suggested a lookup table and envisaged something like:
my %group_table = (
'On' => 'One',
'Two' => 'Two',
'Twel' => 'Twelve',
'Twen' => 'Twenty',
...
);
So, if "Group " is common to all keys, strip that off. Then take increasingly larger substrings from what's left until you get a match. Include some limit so when you've tried X characters and still found no match, give up and put that item in a separate "bucket" for manual intervention.
| [reply] [d/l] [select] |