Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Is there a way to allow consecutive zero-length matches without using pos()?

by jsm (Novice)
on Nov 03, 2017 at 00:40 UTC ( [id://1202659] : perlquestion . print w/replies, xml ) Need Help??

jsm has asked for the wisdom of the Perl Monks concerning the following question:

I have a specialized parser of JavaScript (written in Perl) that often legitimately matches consecutive zero-length strings. As described here, Perl intentionally disallows this, to fix a class of potential infinite-loop bugs.

The normal solution is to set pos() on the input string to reset its zero-length flag. However, in Perls from 5.18 to 5.24, this is very slow when looping through large strings, performing as O(n^2). It may be this bug (my program is a CGI script, thus related to tainting). It's fixed in Perl 5.26, but my program is used in many contexts where the users are not able to upgrade their version of Perl.

My question is: Is there a way other than setting pos() to allow a string to have consecutive zero-length matches? I've tried pos($$in)= pos($$in) and pos($$in)+= 0, but either statement ends up doubling the run time of the whole script. The script is already CPU-intensive, so performance is important here.

(I picture a potential /z regex modifier that allows consecutive zero-length matches.)

Thanks a lot for any suggestions!

UPDATE: It appears to have nothing to do with tainting, but with the UTF-8 flag on the string. Here's a code sample that demonstrates the problem:

#!/usr/bin/perl use strict ; use warnings ; my $st= 'a' x 100000 ; # this program runs as O(n^2) utf8::upgrade($st) ; # ... when operating on a string with the u +tf8 flag set while (1) { $st=~ /\G(?=a)/gc ; pos($st)= pos($st) ; # without this, there's an early exit next +line last unless $st=~ /\G(?=a)/gc ; $st=~ /\Ga/gc ; # increments pos() } print "done\n" ;

Replies are listed 'Best First'.
Re: Is there a way to allow consecutive zero-length matches without using pos()?
by choroba (Cardinal) on Nov 03, 2017 at 09:04 UTC
    Can you show an example of "matching consecutive zero-length strings"?

    I can easily generate a random string with

    my $string = join q(), map +('a' .. 'z')[rand 26], 1 .. 10_000

    but I have no idea what regex(es) to try to match to play with it.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        Thanks for the context. Does rubasov's solution not answer this question? In what way does it fail to meet your criteria?

Re: Is there a way to allow consecutive zero-length matches without using pos()?
by ikegami (Patriarch) on Nov 05, 2017 at 19:46 UTC

    Sounds like you have

    s/\G \s* //xgc; # Skip optional whitespace

    when you should have

    s/\G \s+ //xgc; # Skip optional whitespace

      That's a great guess, but the usual culprit in my code is /\G(?=...)/ . JavaScript has some parsing quirks, and zero-width assertions help to identify the context a token is in. For example, "function" is a keyword but now may also be a property of an object. I'm trying to avoid analyzing the higher-level syntax just to tokenize it.

      I'm working on a simple test case now, which is turning out more complicated than I thought. I'll post it when I get it.

        I'm trying to avoid analyzing the higher-level syntax just to tokenize it.

        Apparently not, since you are trying ascribe meaning to function. That's the job of the parser, not the tokenizer.

Re: Is there a way to allow consecutive zero-length matches without using pos()?
by LanX (Saint) on Nov 03, 2017 at 09:12 UTC
    That's a complicated question and you should really provide a SSCCE to handle it.

    FWIW: my guts say you might use look behind assertions. ...

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: Is there a way to allow consecutive zero-length matches without using pos()?
by Anonymous Monk on Nov 03, 2017 at 13:23 UTC
    It would help tremendously if you could concoct a little example program which creates the strings that you speak of, then illustrates the particular approach(es) that you have taken and/or that you are now considering. Especially if the scripts are such that we can, also, profile it on our own machines.

      OK, I'm putting together a test case now. It's turning out more complicated than I thought, but I'll post it here when I get it working.

      I profiled this with NYTProf and it is indeed the pos()= pos() statement that is taking all the extra time. However, the exact conditions that trigger it are currently eluding me; I'll continue to isolate it.