G'day Marshall,
Thanks for the positive feedback.
I have some comments on your first three points.
Re "... fastest way to remove leading and trailing white space ...".
I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book:
the regex I used was anchored at both ends (/^\s*(.*?)\s*$/).
In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on.
I wrote this benchmark:
#!/usr/bin/env perl -l
use strict;
use warnings;
use constant STRING => " \t aaa bbb ccc \t \n";
use Benchmark 'cmpthese';
print 'Sanity Tests:';
print 'shoura: >', shoura_code(), '<';
print 'kcott: >', kcott_code(), '<';
print 'marshall: >', marshall_code(), '<';
cmpthese 0 => {
S => \&shoura_code,
K => \&kcott_code,
M => \&marshall_code,
};
sub shoura_code {
local $_ = STRING;
chomp;
s/^\s+|\s+$//g;
return $_;
}
sub kcott_code {
local $_ = STRING;
($_) = /^\s*(.*?)\s*$/;
return $_;
}
sub marshall_code {
local $_ = STRING;
s/^\s+//;
s/\s+$//;
return $_;
}
I ran it five times — that's usual for me — here's the result that was closest to an average:
Sanity Tests:
shoura: >aaa bbb ccc<
kcott: >aaa bbb ccc<
marshall: >aaa bbb ccc<
Rate S M K
S 292306/s -- -32% -37%
M 432626/s 48% -- -7%
K 464863/s 59% 7% --
There was quite a lot of variance; although 'K' was always faster than 'M'.
The five K-M percentages were: 9, 7, 2, 14, 7.
Both 'K' and 'M' were always substantially faster than 'S'.
Re "... split your $re statement into two parts ...".
I often use the '@{[...]}' construct when interpolating the results of some processing into a string.
My main intent was to create the regex once,
instead of the (presumably) millions of times in the inner loop of the OP's code.
I also benchmarked this (see the spoiler):
it looks like your total saving would be measured in nanoseconds.
Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.".
I can understand that from the minimal test data supplied by the OP;
however, the reason is probably to handle sequences with common sections.
Consider the test data I used in the second benchmark:
my %seq = (
'W X Y' => 'WbbbXbbbY',
'X Y' => 'XbbbY',
'X Y Z' => 'XbbbYbbbZ',
);
If the target string was "W X Y Z", the results could one of these three:
W XbbbY Z
WbbbXbbbY Z
W XbbbYbbbZ
Sorting by length would reduce that to two results.
There may well be a requirement to also sort lexically. Perhaps like this:
sort { length $b <=> length $a || $a cmp $b }
But the OP has not given sufficient information.
In fact, as I write this, it's been almost two days since the original posting
and all requests for additional information have been ignored.
|