Re^2: script optmization

Ken, I really like your post++.

A couple of very,very minor nits which I show in code below:

I think the fastest way to remove leading and trailing white space is like the code below, using 2 Perl statements instead of $string =~ s/^\s+|\s+$//g or your my ($trim) = /^\s*(.*?)\s*$/;. The Perl documentation talks about this somewhere in the regex docs. But a quick search didn't find this quickly otherwise I would post a link. Anyway, the explanation goes that regex engine works best with fixed anchors and that 2 very easy regex statements run faster than a single more complex one.
I split your $re statement into two parts to simplify the syntax. Creating an intermediate variable is very "cheap". I didn't benchmark, but your code creates an anon array which is then de-referenced. My code only creates a scalar, which in general will be faster.
I see no need at all to sort the search terms, so I didn't do that. The regex is going to match any of the 3 or'd "search phrases" no matter what the order in the regex is. Changing the order in the regex will not necessarily result in any performance change at all. The OP's requirement "for a sorted order" makes no sense to me at all.
I see some suggestion to use threads or other parallel processing strategies. It appears to me that this will be an I/O bound application and such complex things won't matter at all.

Having said the above. Neither point makes a darn bit of difference in this case. I made this post because point (1) has relevance beyond this Op's question. For performance: The "setup" won't matter much because this is done once. Then: Read Line, Run Regex, Print Line is about as fast as this usually gets without complicated heroics.

Another Monk queried about the OP's purpose? Sometimes a post is just an academic question. Sounds like there is some real application here that we don't understand. The only reason to put these "markers" into the text is for later processing. Maybe that processing, whatever it is, can be combined into a single step? That could lead to a big speed increase. I mean that second step of processing will have to search the entire text to find the bbb markers yet again.

#!/usr/bin/env perl

use strict;
use warnings;

use Inline::Files;

my %seq; # example: 'scooped up again' => 'scoopedbbbupbbbagain',

while (my $line = <SEQ>) 
{
    $line =~ s/^\s+//;
    $line =~ s/\s+$//;

    ($seq{$line} = $line) =~ s/\h+/bbb/g;
}   

my $search_phrases = join '|', keys %seq;
my $re = qr{($search_phrases)};

while (<TXT>) {
    s/$re/$seq{$1}/g;
    print;
}   

__SEQ__
          scooped up by
          social travesty
          without proper sanitation
__TXT__
Many of them are scooped up by chambermaids, thrown into bin bags and 
+sent off to landfill sites, which is a disaster for the environment a
+nd a social travesty given that many people around the world are goin
+g without proper sanitation.
[download]

Comment on Re^2: script optmization Select or Download Code

Replies are listed 'Best First'.

Re^3: script optmization
by kcott (Archbishop) on May 16, 2017 at 07:08 UTC

G'day Marshall,

Thanks for the positive feedback. I have some comments on your first three points.

Re "... fastest way to remove leading and trailing white space ...". I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book: the regex I used was anchored at both ends (/^\s*(.*?)\s*$/). In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on. I wrote this benchmark:

#!/usr/bin/env perl -l

use strict;
use warnings;
use constant STRING => " \t aaa bbb ccc \t \n";

use Benchmark 'cmpthese';

print 'Sanity Tests:';
print 'shoura:    >', shoura_code(),   '<';
print 'kcott:     >', kcott_code(),    '<';
print 'marshall:  >', marshall_code(), '<';

cmpthese 0 => {
    S => \&shoura_code,
    K => \&kcott_code,
    M => \&marshall_code,
};

sub shoura_code {
    local $_ = STRING;

    chomp;
    s/^\s+|\s+$//g;

    return $_;
}

sub kcott_code {
    local $_ = STRING;

    ($_) = /^\s*(.*?)\s*$/;

    return $_;
}

sub marshall_code {
    local $_ = STRING;

    s/^\s+//;
    s/\s+$//;

    return $_;
}
[download]

I ran it five times — that's usual for me — here's the result that was closest to an average:

Sanity Tests:
shoura:    >aaa bbb ccc<
kcott:     >aaa bbb ccc<
marshall:  >aaa bbb ccc<
      Rate    S    M    K
S 292306/s   -- -32% -37%
M 432626/s  48%   --  -7%
K 464863/s  59%   7%   --
[download]

There was quite a lot of variance; although 'K' was always faster than 'M'. The five K-M percentages were: 9, 7, 2, 14, 7. Both 'K' and 'M' were always substantially faster than 'S'.

Re "... split your $re statement into two parts ...". I often use the '@{[...]}' construct when interpolating the results of some processing into a string. My main intent was to create the regex once, instead of the (presumably) millions of times in the inner loop of the OP's code. I also benchmarked this (see the spoiler): it looks like your total saving would be measured in nanoseconds.

Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.". I can understand that from the minimal test data supplied by the OP; however, the reason is probably to handle sequences with common sections. Consider the test data I used in the second benchmark:

my %seq = (
    'W X Y' => 'WbbbXbbbY',
    'X Y'   => 'XbbbY',
    'X Y Z' => 'XbbbYbbbZ',
);
[download]

If the target string was "W X Y Z", the results could one of these three:

W XbbbY Z
WbbbXbbbY Z
W XbbbYbbbZ
[download]

Sorting by length would reduce that to two results. There may well be a requirement to also sort lexically. Perhaps like this:

sort { length $b <=> length $a || $a cmp $b }
[download]

But the OP has not given sufficient information. In fact, as I write this, it's been almost two days since the original posting and all requests for additional information have been ignored.

— Ken

[reply]
[d/l]
[select]


Don't ask to ask, just ask
	PerlMonks