Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: how to speed up that dynamic regex?

by Athanasius (Archbishop)
on Nov 04, 2014 at 13:42 UTC ( [id://1106030]=note: print w/replies, xml ) Need Help??


in reply to how to speed up that dynamic regex?

Hello rsFalse,

First, the code you have does not appear to be producing correct results. For example, you have this line of input:

()(())()

which produces:

8 1

but by inspection we can see that the longest subsequence of validly-nested parentheses is (()) which is only 4 characters long.

Second, as Eily says you can get a significant speedup by using (?PARNO). For example, adapting the code in Can-I-use-Perl-regular-expressions-to-match-balanced-text of perlfaq6, I came up with this:

use strict; use warnings; use List::Util 'max'; srand; my @lines = <DATA>; my $n = 1000; my $long = ''; $long .= qw{( )}[ int(rand() * 2) ] for 1 .. $n; my $regex = qr{ ( # start of capture group 1 \( # match an opening parenthesis (?: [^()]++ # one or more non-parentheses, non backtra +cking | # OR (?1) # recurse to capture group 1 )* \) # match a closing parenthesis ) # end of capture group 1 }x; for my $string (@lines, $long) { my $max = 0; my $occ = 1; my @groups = $string =~ m/$regex/g; if (@groups) { @groups = sort { length $a <=> length $b } @groups; $max = length $groups[-1]; for my $i (reverse 0 .. $#groups - 1) { if (length $groups[$i] == length $groups[-1]) { ++$occ; } else { last; } } } print "$max $occ\n"; } __DATA__ )((())))(()()) ))( ()(())() ((((()((( (()())()(())()()())())()((()(()(())()()())((()(())()(()()()()))()(())( +)(((()())()(()((())()(())(()))

Typical output:

23:28 >perl 1071_SoPW.pl 6 2 0 1 4 1 2 1 20 1 266 1 23:28 >

For a long string this is much faster than using (??{ code }).

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: how to speed up that dynamic regex?
by AnomalousMonk (Archbishop) on Nov 05, 2014 at 01:59 UTC
    my $regex = qr{ ( # start of capture group 1 ... (?1) # recurse to capture group 1 ... ) # end of capture group 1 }x;

    A neat feature of the  (?PARNO) extended pattern (available with Perl versions 5.10+) is that the numbering of capture groups can be made relative instead of absolute. This shines brightest when defining  qr// regex objects, which are designed to be interpolated into other  qr// m// s/// expressions. The logic of group recursion can be encapsulated and made independent of whatever other expressions go into the final regex.

    In Athanasius's code above, if even one more capturing group sneaks into the  m// ahead of  $regex in the extraction expression
        my @groups = $string =~ m/$regex/g;
    as in
        my @groups = $string =~ m/(x?)$regex/g;
    capture group numbering is thrown off and its function is destroyed. If the absolute  (?1) group recursion in
        my $regex = qr{ (... (?1) ...) }x;
    is made relative with  (?-1) then any number of extra preceding capture groups will make no difference to its function:
        my @groups = $string =~ m/(x?)(x?)(x?)$regex/g;

      It seems that using (?PARNO) in my example speeds up coping against long random line of parentheses at least 10 times on average.
      But still there are some frequent random data, that regex works slow. I changed the length of random line to 50000, and then sometimes regex cope in 0.1 sec, and sometimes it takes >5 sec.
      I only changed regex a little bit:
      my $regex = qr{ # start of full pattern \( # match an opening parenthesis (?: 0 # everytime false, because of such DATA | # OR (?0) # recurse to capture full pattern )* \) # match a closing parenthesis # end of full pattern }x;
      One more question: how to find consecutive balanced parentheses (and their max length with frequency of them)? I used "+" quantifier in: m/$regex+/g (/$m+/g). Now if I use this quantifier, program works with really bad asymptotics, but if I use possesive m/$regex++/g, it run faster but still slow. Why?
        Just solved that way:
        my @groups = (); my $pred = -1; my $j = 0; while ($string =~ m/$regex/g){ $pred == $-[0] or $j++; $groups[$j] += $+[0] - $-[0]; $pred = $+[0]; }
        And reduced regex to:
        qr{ # start of full pattern \( # match an opening parenthesis (?0)* # recurse to capture full pattern \) # match a closing parenthesis # end of full pattern }x;


        UPDATE: seems that using recursive regex is dangerous, because in rare random "unlucky cases" it takes very much time, although in often "lucky cases" it takes little amount of time. I found seemingly better and faster (or say more stable) way - just glue long regex and use+modify through iterations against the target string. Code:
        #use strict; use warnings; use Time::HiRes; $|++; srand; my @lines = <>; my $n = 50000; my $long = ''; $long .= qw{( )}[ int(rand() * 2) ] for 1 .. $n; push @lines, $long; my @lines2 = @lines; my $start = Time::HiRes::time(); my $depth = 10; # some depth parameter my $pre_regex2 = '(?:\((?:\d+,)*(?:\)|' x $depth . '(?!)' . '\)))+' x $depth; # if $depth == 1 then looks: '(?:\((?:\d+,)*(?:\)|x\)))+'; # (?: # \( # (?:\d+,)* (?: # \) # | # (?!)\) # ) # )+ my $regex2 = qr/$pre_regex2/; my $regex3 = qr/(?:\d+,){2,}/; for my $string (@lines2){ while ( $string =~ s/$regex2/ $a=$&, $b=()=$a=~m{[)(]}g, $a=~s{\d+,}{$b+=$&,''}eg, "$b," /eg or $string =~ s/$regex3/ $a=$&, $b=0, $a=~s{\d+,}{$b+=$&,''}eg, "$b," /eg ) {} my @groups = $string =~ m/\d+/g; my $max=0; for my $group (@groups){ $max < $group and $max = $group } my $occ = 0; for my $group (@groups){ $occ += $max == $group } $max or $occ=1; print "$max $occ\n"; } print Time::HiRes::time() - $start, $/; $start = Time::HiRes::time(); my $regex = qr{ \( # match an opening parenthesis (?0)* # recurse to capture full pattern \) # match a closing parenthesis }x; for my $string (@lines) { my $max = 0; my $occ = 1; my @groups = (); my $pred = -1; my $i = 0; while ($string =~ m/$regex/g){ $pred == $-[0] or $i++; $groups[$i] += $+[0] - $-[0]; $pred = $+[0]; } if (@groups) { @groups = sort { $a <=> $b } @groups; $max = $groups[-1]; for my $i (reverse 0 .. $#groups - 1) { if ( $groups[$i] == $groups[-1]) { ++$occ; } else { last; } } } print "$max $occ\n"; } print Time::HiRes::time() - $start, $/;
        ( http://ideone.com/K12225 )
        Usually first block takes ~0,5 sec, when second block (recursive regex) takes from 0.01 sec to more than 5 sec.
Re^2: how to speed up that dynamic regex?
by rsFalse (Chaplain) on Nov 04, 2014 at 23:45 UTC
    Such verbose! :) Thanks! I'll read about (?PARNO).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1106030]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-23 09:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found