Re: Counting instances of a string in certain sections of files within a hash

"Any guidance would be very much appreciated."

I'm going to assume that your "corpus" is the same, or similar, data to what we discussed in "Re: Storing output of a subroutine into an hash and then printing hash":

I said:

"... it sounds like there's a lot of data, and %corpus may contain hundreds (thousands? millions?) of key-value pairs. ... I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash."

You replied:

"... eventually I will be working with millions of key-value pairs, so I will attempt to use the hashref method."

In the example code below, I have used a hashref. I do recommend that you implement that sooner, rather than later. Changing &getCorpus is ridiculously simple and only involves adding one character: "return %corpus;" → "return \%corpus;". Changing all the code that uses that will likely be a lot more work. If you're writing a number of utilities to perform different tasks with this "corpus" data — I'm getting that impression but don't really know — multiply "a lot more work" by "number of utilities".

Bearing in mind the size of the data, you should be aiming to streamline your code. You are currently performing a lot of tasks that are quite unecessary: these could noticeably slow your application. Here's some suggestions for things to change:

You have "foreach my $filename (sort keys %mycorpus) { ... }".
- Your code indicates you have no interest in the order of the keys: remove sort.
- Your code indicates you have no interest in the keys themselves. You only use them to access the values ($mycorpus{$filename}) in a couple of places. Skip over all that unnecessary processing by changing keys to values: now you can iterate just the data you want to work with.
You have if conditions using regexes with the 'g' modifier. In this particular context, that modifier is pointless: remove it. See "perlre: Modifiers" for more about that.
You have made liberal use of lookaround assertions (see "perlre: Lookaround Assertions"). These are all extra work for the regex engine and none are actually required here. See the regexes in my example script below which doesn't use these assertions.

There's another issue of data validation. You have, in multiple places, code like:

my $var;
if ($string =~ /$capture_regex/) {
    $var = $1;
}
[download]

Processing now continues with a potentially uninitialised $var. This could easily cause problems downstream: possibly difficult to debug.

If &getCorpus performs validation and guarantees the data it returns, you could just write:

my ($var) = $string =~ /$capture_regex/;
[download]

If you do need to validate it yourself, but simply want to skip invalid data, don't continue processing. Instead, you can do something along these lines:

next unless $string =~ /$capture_regex/;
my $var = $1;
[download]

You may want to validate, report issues, then skip the remainder of the current iteration:

my $var;
if ($string =~ /$capture_regex/) {
    $var = $1;
}
else {
    # ... issue warning, make a log entry, or whatever ...
    next;
}
[download]

"... finds the word soft (or softest, softer etc.) ..."

Whether that's a word you want to find in your real data, or just an example for test purposes, you may want to consider if "/\bsoft/" is sufficient. My system's dictionary has 30 words containing "soft", 22 of which start with "soft":

$ grep soft /usr/share/dict/words | wc -l
30
$ grep ^soft /usr/share/dict/words | wc -l
22
[download]

Do you want to exclude words like "semisoft"? Do you want to include words like "softball"? You may want to create a whitelist so that you know exactly what you're matching; perhaps something along these lines:

/ \b (?: soft | softer | softest ) \b /ix
[download]

Here's a minimal script that covers all the points I've raised.

#!/usr/bin/env perl

use strict;
use warnings;

my $date_re = qr{(?x: ^ <time \s datetime= ( \d{4} - \d{2} - \d{2} ) )
+};
my $want_re = qr{(?sx: [#][#] ( .*? ) [#][#] )};
my $soft_re = qr{(?ix: \b ( soft ) )};

my %count_for_date;

for (values %{ get_corpus() }) {
    my ($date) = /$date_re/;
    my ($want) = /$want_re/;
    $count_for_date{$date}++ while $want =~ /$soft_re/g;
}

# For testing only:
use Data::Dump;
dd \%count_for_date;

sub get_corpus {
    my %corpus = (
        fileA => "<time datetime=2017-09-01...
            soft soft
            soft
            ##... soft soft ... soft ...##
            hard
            soft
        ",
        fileB => "<time datetime=2017-09-01...
            soft
            ##... hard ...##
            soft
        ",
        fileC => "<time datetime=2017-09-01...
            ##... softball ...##
        ",
        fileD => "<time datetime=2017-09-02...
            ##... semisoft ...##
        ",
        fileE => "<time datetime=2017-09-03...
            ##... soft softer softest softly soften softner ...##
        ",
        fileF => "<time datetime=2017-09-04...
            ##
                soft softer softest 
                Soft Softer Softest 
                softly soften softner 
                Softly Soften Softner 
            ##
        ",
    );

    return \%corpus;
}
[download]

Output:

{ "2017-09-01" => 4, "2017-09-03" => 6, "2017-09-04" => 12 }
[download]

If you also wanted dates with zero matches, you can add a line like this after you capture $date:

$count_for_date{$date} ||= 0;
[download]

The output then becomes:

{ "2017-09-01" => 4, "2017-09-02" => 0, "2017-09-03" => 6, "2017-09-04
+" => 12 }
[download]

— Ken

Comment on Re: Counting instances of a string in certain sections of files within a hash Select or Download Code

Replies are listed 'Best First'.

Re^2: Counting instances of a string in certain sections of files within a hash
by Maire (Scribe) on Nov 01, 2017 at 10:26 UTC

Wow, thank you so much for all of this and especially for your very clear explanations of the changes/improvements that you have made to the original script.

Your assumptions that the data is the same as that from our previous discussion and also that I will be performing numerous tasks using this data are correct. I've now implemented the "return \%corpus;" change across the scripts that use that function, thanks!

Unfortunately, I haven't yet been able to get this improved script to produce any output when I change the hash from the example one to the hash created by my getCorpus function. However, as I speculated in my reply to hippo above, I suspect that this may be something to do with the function itself. I'm going to run a few tests on small folders of data, to see if I can spot the exact problem.

Thanks again for all of this!

UPDATE: Ah, the script wasn't producing any output with my function because I was using incorrect syntax to print! I've now got output but, unfortunately, the count values are still lower than they should be. Interestingly, however, they are reporting the same frequencies as the version of this script that hippo helped me with:

{ "2017-09-04" => 1 }
not ok 1 - 2017-09-04 tally correct
#   Failed test '2017-09-04 tally correct'
#   at C:\Users\lisad\PhD\perl\test11.pl line 62.
#          got: '1'
#     expected: '3'
not ok 2 - 2017-09-30 tally correct
#   Failed test '2017-09-30 tally correct'
#   at C:\Users\lisad\PhD\perl\test11.pl line 63.
#          got: undef
#     expected: '2'
# Looks like you failed 2 tests of 2.
[download]

[reply]
[d/l]


No such thing as a small change
	PerlMonks