I sense there is a simpler way...

HelgeG has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a script that does what it should, and runs quickly, yet I have a feeling that I can accomplish what I want more quickly. Bear with me if this seems too simplistic, I have only been using perl for a couple of weeks, and I have only just started on the path to enlightenment.

I have a large text file that contains several fields in a fixed format. Among the fields are a unique numerical ID, and a text ID that ideally should be unique. The text id also should not start with a capital letter, but sometimes does.

My script reads the file, detects IDs that start with a capital letter, and also detects if any of the textual IDs are duplicated. The numerical IDs are always unique.

I find duplicates by storing a count in a hash where the text id is the key. After I have filled the hash, I then traverse it to find values higher than one. If such a value is found, I run through the entire file again to find the numerical IDs of the duplicates.

This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below:

#!perl -w
use strict;
my(@lines,%dup, $id, $key);
chomp(@lines=<STDIN>);
# read file, put count of key in hash
print "Keys with initial caps:\n";
foreach(@lines) {
   if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+(\w*)/) {
      ($id, $key)= ($1, $2);
      $dup{$key} += 1;
      # check if any key has init caps (not allowed)
      if ($key =~ m/^[A-Z]\w*/) {
         print "Id: $id - $key\n";
      }
   }
}
# check hash for duplicates, if found, display positions in file
print "Duplicated keys:\n";
foreach $key (keys %dup) {
  if ($dup{$key}>1) {
     print "$key ($dup{$key})\n";
     foreach(@lines) {
        if (m/PROBABLECAUSE\w*\((\d+),\s*\w*,\s+($key),/) {
           print "Id: $1\n";
        }
     }
  }
}
[download]

A typical line in the data file looks like this:

PROBABLECAUSE(0, probablecauseUndefined, undefined, Unknown, indetermi
+nate, prim, false, "", UNIDENTIFIED, Y)
[download]

The values I look at are the first and the third value in the argument list.

Thank you for your time.

update

Using the tips I received, the solution is now cleaner and more elegant, and as an added bonus, I have learned about perl references. Thank you, monks!

Comment on I sense there is a simpler way... Select or Download Code

Replies are listed 'Best First'.
Re: I sense there is a simpler way... by Random_Walk (Prior) on Aug 19, 2004 at 16:54 UTC
Helge, Here is a begining, this only parses the input data once. It could be more efficient, possibly the regexp could be turned into a split on `/,\s\|\(/` and then a test made on the first part of the split, not sure what sort of data you are trying not to match but possibly something like this `my ($test, $id, $key)=split /,\s\|\(/, $_, 4; next unless $test=/PROBABLECAUSE\w/;` [download] You may want to fix keys starting with a cap so they match duplicates without. I put the processing in the loop reading STDIN, this was so I could test it easily by bashing a few lines in by hand. May also be a little more efficient as the script starts working as soon as it has its first line rather than waiting till all is in. #!/use/your/bin/perl -w use strict; my($line_count, @lines, %dup, @capkeys)=0; while (<STDIN>) { if (m/PROBABLECAUSE\w\((\d+),\s\w,\s+(\w)/) { my ($id, $key)= ($1, $2); # check if any key has init caps (not allowed) if ($key =~ m/^[A-Z]\w/) { push @capkeys, "$line_count: $id - $key\n"; # you may want to fix up caps here before you store # the key so Keysomething matches keysomething } if (exists $dup{$key}) { print "Duplicates found\n"; print "$dup{$key}->[0]: $dup{$key}->[1]\n"; print "$line_count: $_\n"; } else { $dup{$key}=[$line_count, $_]; # store line } $line_count++; } } print "keys with initial caps\n" if @capkeys; foreach (@capkeys) {print} [download] update Got my split sugestion a little wrong, field misscount should be `my ($test, $id, undef, $key)=split /,\s\|\(/, $_, 5; next unless $test=/PROBABLECAUSE\w/;` [download] In the initial caps test is the \w* really required or would `if ($key =~ m/^[A-Z]/` be OK ? What if the entire key is in upper case or will that never happen ? cleaner code, now taking liberties ! Lets just fix those upper cased keys and not trouble the poor users.... `#!/your/perl -w use strict; my($line_count, %dup)=0; while (<STDIN>) { my ($test, $id, undef, $key)=split /,\s\|\(/, $_, 5; next unless $test=/PROBABLECAUSE\w/; # If i may be so bold... $key = lc $key; if (exists $dup{$key}) { print "Duplicates found\n"; print "$dup{$key}->[0]: $dup{$key}->[1]"; print "$line_count: $_"; } else { $dup{$key}=[$line_count, $_]; # store No and line } $line_count++; }` [download]	[reply] [d/l] [select]
Re: I sense there is a simpler way... by calin (Deacon) on Aug 19, 2004 at 17:06 UTC
I think your two-pass approach is fine in principle. Because you can't know in advance if a record has duplicates, you'll have to keep the ID of all records in memory just in case, in a single-pass approach. Whether this is feasible, it depends on the expected size of the file. A suggestion for a single-pass approach would be to make `$dup{key}` an array-ref: `while (<STDIN>) { chomp; if (m/PROBABLECAUSE\w\((\d+),\s\w,\s+(\w)/) { ($id, $key)= ($1, $2); push @{$dup{$key}}, $id; #this is modified! # check if any key has init caps (not allowed) if ($key =~ m/^[A-Z]\w/) { print "Id: $id - $key\n"; } } } print "Duplicated keys:\n"; for my $key (keys %dup) { my ($ids, $count) = map {$_, scalar @$_} $dup{$key}; next unless $count > 1; print "$key ($count)\n"; print "Id: $_\n" for @$ids; }` [download] Update:* I failed to see that you read-in the whole file in `@lines` to begin with. Code modified to avoid this. My comment about single-pass / two pass becomes a bit irrelevant in the new light. More This means that I first go through the file once to detect duplicates, and then go through the file again once for each duplicate found. I can't help but think that there is a more elegant and efficient way of doing things. My code is shown below: This confused me at first, because I didn't read your code carefully. Actually, in your original code, you don't go through the file twice (in I/O terms). You actually read the whole file line by line into an array, then loop over that array twice, populating a hash in the first pass. My solution also goes through the file only once (in a while loop), populating a deep data structure (hash of arrays), then, in a second loop, it goes over the elements of that hash printing those with more than one ID. As for writing the whole program in a single loop it's not possible, because you have to basically group-by. Random_Walk above cheats by assuming there can be a maximum of a single duplicate for any given textual key.	[reply] [d/l] [select]
Re^2: I sense there is a simpler way... by HelgeG (Scribe) on Aug 20, 2004 at 00:24 UTC
Random Walk and calin, thank you both for your replies. Creating a deep data structure (hash of arrays) was the step that eluded me. Is gobbling an entire file into an array considered bad form? My datafile is roughly half a megabyte in size, so I figured memory was not an issue. I can see however that reading the file line by line makes for more scalable code. What was bugging me about my own code was that I in fact had an N+1 pass approach, where N was the number of duplicated keys. I was reading the file once, and then cycling several times over the array. calin, you are right about the fact that there can be more than a single duplicate for any textual field, so the code needs to account for this. Again, thanks for the time the both of you spent looking at my code, it is much appreciated!	[reply]
Re^3: I sense there is a simpler way... by jdporter (Paladin) on Aug 22, 2004 at 20:30 UTC
Is gobbling an entire file into an array considered bad form? . . . One should always be aware of the efficiency concern. If you're sure the file will never be "too big", sluurping (as it's called) shouldn't be a problem. Otherwise, you'd do well to try to do per-record reading/processing, where practical. Calin's solution is good. If you want a little extra efficiency, you can buy it with memory, i.e. data structures. In the solution below, we maintain a separate hash for those keys which are known to be duplicates. Then, at the end, we iterate only over that hash. This has a pay-off if the number of duplicate keys is significantly smaller than the total number of keys. my( %keys, %dup ); while (<STDIN>) { chomp; if ( /PROBABLECAUSE\w\((\d+),\s\w,\s+(\w)/ ) { my( $id, $key ) = ( $1, $2 ); if ( exists $dup{$key} ) # already found to be a dup { push @{ $dup{$key} }, $id; } elsif ( exists $keys{$key} ) # only seen once before { push @{ $dup{$key} }, delete($keys{$key}), $id; } else # first time seen { $keys{$key} = $id; } # check if any key has init caps (not allowed) if ( $key =~ /^[A-Z]\w*/ ) { print "Id: $id - $key\n"; } } } print "\nDuplicated keys:\n\n"; for my $key ( keys %dup ) { print "Key: $key\n"; print "\tId: $_\n" for @{$dup{$key}}; } [download] (Not tested)	[reply] [d/l]
Re^4: I sense there is a simpler way... by HelgeG (Scribe) on Aug 23, 2004 at 09:43 UTC


P is for Practical
	PerlMonks

I sense there is a simpler way...

update

cleaner code, now taking liberties !