No such thing as a small change | |
PerlMonks |
Re: Counting instances of a string in certain sections of files within a hashby kcott (Archbishop) |
on Oct 31, 2017 at 21:33 UTC ( [id://1202477]=note: print w/replies, xml ) | Need Help?? |
G'day Maire, "Any guidance would be very much appreciated." I'm going to assume that your "corpus" is the same, or similar, data to what we discussed in "Re: Storing output of a subroutine into an hash and then printing hash":
In the example code below, I have used a hashref. I do recommend that you implement that sooner, rather than later. Changing &getCorpus is ridiculously simple and only involves adding one character: "return %corpus;" → "return \%corpus;". Changing all the code that uses that will likely be a lot more work. If you're writing a number of utilities to perform different tasks with this "corpus" data — I'm getting that impression but don't really know — multiply "a lot more work" by "number of utilities". Bearing in mind the size of the data, you should be aiming to streamline your code. You are currently performing a lot of tasks that are quite unecessary: these could noticeably slow your application. Here's some suggestions for things to change:
There's another issue of data validation. You have, in multiple places, code like:
Processing now continues with a potentially uninitialised $var. This could easily cause problems downstream: possibly difficult to debug. If &getCorpus performs validation and guarantees the data it returns, you could just write:
If you do need to validate it yourself, but simply want to skip invalid data, don't continue processing. Instead, you can do something along these lines:
You may want to validate, report issues, then skip the remainder of the current iteration:
"... finds the word soft (or softest, softer etc.) ..." Whether that's a word you want to find in your real data, or just an example for test purposes, you may want to consider if "/\bsoft/" is sufficient. My system's dictionary has 30 words containing "soft", 22 of which start with "soft":
Do you want to exclude words like "semisoft"? Do you want to include words like "softball"? You may want to create a whitelist so that you know exactly what you're matching; perhaps something along these lines:
Here's a minimal script that covers all the points I've raised.
Output:
If you also wanted dates with zero matches, you can add a line like this after you capture $date:
The output then becomes:
— Ken
In Section
Seekers of Perl Wisdom
|
|