Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: script optmization

by kcott (Archbishop)
on May 15, 2017 at 03:53 UTC ( [id://1190286]=note: print w/replies, xml ) Need Help??


in reply to script optmization

G'day shoura,

Here's some techniques for removing redundant and duplicated processing.

  • There's a single pass through the SEQ file. Trimming leading and trailing whitespace, as well as the line terminator, uses a single regex capture. The replacement text, for later use, is calculated once here.
  • The regex for use when processing the TXT file is created once only.
  • There's a single pass through the TXT file; and there's no nesting. The substitution, using values already calculated, is the only processing done here.
  • I've used Inline::Files as the actual opening and closing of your files doesn't appear to need optimising. Having said that, I would urge you to use lexical filehandles throughout, and the 3-argument form of open.
#!/usr/bin/env perl use strict; use warnings; use Inline::Files; my %seq; while (<SEQ>) { my ($trim) = /^\s*(.*?)\s*$/; ($seq{$trim} = $trim) =~ s/\h+/bbb/g; } my $re = qr{(@{[join '|', sort { length $b <=> length $a } keys %seq]} +)}; while (<TXT>) { s/$re/$seq{$1}/g; print; } __SEQ__ scooped up by social travesty without proper sanitation __TXT__ Many of them are scooped up by chambermaids, thrown into bin bags and +sent off to landfill sites, which is a disaster for the environment a +nd a social travesty given that many people around the world are goin +g without proper sanitation.

Output:

Many of them are scoopedbbbupbbbby chambermaids, thrown into bin bags +and sent off to landfill sites, which is a disaster for the environme +nt and a socialbbbtravesty given that many people around the world ar +e going withoutbbbproperbbbsanitation.

— Ken

Replies are listed 'Best First'.
Re^2: script optmization
by Marshall (Canon) on May 15, 2017 at 23:21 UTC
    Ken, I really like your post++.

    A couple of very,very minor nits which I show in code below:

    1. I think the fastest way to remove leading and trailing white space is like the code below, using 2 Perl statements instead of $string =~ s/^\s+|\s+$//g or your my ($trim) = /^\s*(.*?)\s*$/;. The Perl documentation talks about this somewhere in the regex docs. But a quick search didn't find this quickly otherwise I would post a link. Anyway, the explanation goes that regex engine works best with fixed anchors and that 2 very easy regex statements run faster than a single more complex one.
    2. I split your $re statement into two parts to simplify the syntax. Creating an intermediate variable is very "cheap". I didn't benchmark, but your code creates an anon array which is then de-referenced. My code only creates a scalar, which in general will be faster.
    3. I see no need at all to sort the search terms, so I didn't do that. The regex is going to match any of the 3 or'd "search phrases" no matter what the order in the regex is. Changing the order in the regex will not necessarily result in any performance change at all. The OP's requirement "for a sorted order" makes no sense to me at all.
    4. I see some suggestion to use threads or other parallel processing strategies. It appears to me that this will be an I/O bound application and such complex things won't matter at all.

    Having said the above. Neither point makes a darn bit of difference in this case. I made this post because point (1) has relevance beyond this Op's question. For performance: The "setup" won't matter much because this is done once. Then: Read Line, Run Regex, Print Line is about as fast as this usually gets without complicated heroics.

    Another Monk queried about the OP's purpose? Sometimes a post is just an academic question. Sounds like there is some real application here that we don't understand. The only reason to put these "markers" into the text is for later processing. Maybe that processing, whatever it is, can be combined into a single step? That could lead to a big speed increase. I mean that second step of processing will have to search the entire text to find the bbb markers yet again.

    #!/usr/bin/env perl use strict; use warnings; use Inline::Files; my %seq; # example: 'scooped up again' => 'scoopedbbbupbbbagain', while (my $line = <SEQ>) { $line =~ s/^\s+//; $line =~ s/\s+$//; ($seq{$line} = $line) =~ s/\h+/bbb/g; } my $search_phrases = join '|', keys %seq; my $re = qr{($search_phrases)}; while (<TXT>) { s/$re/$seq{$1}/g; print; } __SEQ__ scooped up by social travesty without proper sanitation __TXT__ Many of them are scooped up by chambermaids, thrown into bin bags and +sent off to landfill sites, which is a disaster for the environment a +nd a social travesty given that many people around the world are goin +g without proper sanitation.

      G'day Marshall,

      Thanks for the positive feedback. I have some comments on your first three points.

      Re "... fastest way to remove leading and trailing white space ...". I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book: the regex I used was anchored at both ends (/^\s*(.*?)\s*$/). In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on. I wrote this benchmark:

      #!/usr/bin/env perl -l use strict; use warnings; use constant STRING => " \t aaa bbb ccc \t \n"; use Benchmark 'cmpthese'; print 'Sanity Tests:'; print 'shoura: >', shoura_code(), '<'; print 'kcott: >', kcott_code(), '<'; print 'marshall: >', marshall_code(), '<'; cmpthese 0 => { S => \&shoura_code, K => \&kcott_code, M => \&marshall_code, }; sub shoura_code { local $_ = STRING; chomp; s/^\s+|\s+$//g; return $_; } sub kcott_code { local $_ = STRING; ($_) = /^\s*(.*?)\s*$/; return $_; } sub marshall_code { local $_ = STRING; s/^\s+//; s/\s+$//; return $_; }

      I ran it five times — that's usual for me — here's the result that was closest to an average:

      Sanity Tests: shoura: >aaa bbb ccc< kcott: >aaa bbb ccc< marshall: >aaa bbb ccc< Rate S M K S 292306/s -- -32% -37% M 432626/s 48% -- -7% K 464863/s 59% 7% --

      There was quite a lot of variance; although 'K' was always faster than 'M'. The five K-M percentages were: 9, 7, 2, 14, 7. Both 'K' and 'M' were always substantially faster than 'S'.

      Re "... split your $re statement into two parts ...". I often use the '@{[...]}' construct when interpolating the results of some processing into a string. My main intent was to create the regex once, instead of the (presumably) millions of times in the inner loop of the OP's code. I also benchmarked this (see the spoiler): it looks like your total saving would be measured in nanoseconds.

      Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.". I can understand that from the minimal test data supplied by the OP; however, the reason is probably to handle sequences with common sections. Consider the test data I used in the second benchmark:

      my %seq = ( 'W X Y' => 'WbbbXbbbY', 'X Y' => 'XbbbY', 'X Y Z' => 'XbbbYbbbZ', );

      If the target string was "W X Y Z", the results could one of these three:

      W XbbbY Z WbbbXbbbY Z W XbbbYbbbZ

      Sorting by length would reduce that to two results. There may well be a requirement to also sort lexically. Perhaps like this:

      sort { length $b <=> length $a || $a cmp $b }

      But the OP has not given sufficient information. In fact, as I write this, it's been almost two days since the original posting and all requests for additional information have been ignored.

      — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1190286]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-03-19 09:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found