Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

5.10.0 regex slowdown

by BrowserUk (Pope)
on Feb 21, 2008 at 00:33 UTC ( #669148=perlquestion: print w/ replies, xml ) Need Help??
BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

The following code builds a large alternation regex which on 5.10.0 should cause demerphq's cool tries code to kick in, and it seems to. When I specify 13,104 alternations, it can search a 1MB string in under 3 seconds which is very impressive. However, if I increase it by one alternation to 13,105, it suddenly goes pathelogical and takes 646 seconds to complete the same task.

The alternations I am using are 10-chars giving an array size of 13,105 * 10 + internals ~= 128k. This number rings a bell. I seem to recall that this is the magic number at which memory allocations bypass the normal allocation mechanism and go straight to allocating virtual memory? I've done a cursory grep of the sources trying to confirm this without success. Does anyone know anything about it?

It would also be useful to know if 5.10 builds other than AS1000 are similarly affected. If anyone with such a build has the time to try this out for me, the required command line options are shown after the __END__ token below:

#! perl -slw use strict; $|++; srand 1; sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] } our $N ||= 1000; our $S ||= 30e6; my $chrom = chr(0) x $S; substr( $chrom, $_, 1, ('A', 'C', 'G', 'T' )[ rand 4 ] ) for 0 .. $S-1 +; my @probes = map rndStr( 10, 'A', 'C', 'G', 'T' ), 0 .. $N; warn "Trying $N probes with perl $]"; warn time; my $re = join '|', @probes[ 0 .. $N ]; $re = qr[$re]; printf "$-[ 0 ] : '$1'\n" while $chrom =~ m[($re)]g; print "\n"; warn time; __END__ C:\test>\Perl510\bin\perl5.10.0.exe -s 668954.p10 -S=1e6 -N=13104 | wc + -l Trying 13104 probes with perl 5.010000 at 668954.p10 line 18. 1203549864 at 668954.p10 line 20. 1203549867 at 668954.p10 line 27. 11282 C:\test>\Perl510\bin\perl5.10.0.exe -s 668954.p10 -S=1e6 -N=13105 | wc + -l Trying 13105 probes with perl 5.010000 at 668954.p10 line 18. 1203549872 at 668954.p10 line 20. 1203550518 at 668954.p10 line 27. 11282

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on 5.10.0 regex slowdown
Download Code
Re: 5.10.0 regex slowdown
by rhesa (Vicar) on Feb 21, 2008 at 00:54 UTC
    I'm not even going to pretend having knowledge of the regex engine, but I'm happy to do the gruntwork :-)

    Confirmed on Debian Etch amd64 with a built-from-source perl 5.10.0:

    rhesa@apollo64:~/devel/test/pm$ perl -s 669148 -S=1e6 -N=13104 | wc -l Trying 13104 probes with perl 5.010000 at 669148 line 18. 1203554358 at 669148 line 20. 1203554362 at 669148 line 27. 11376 rhesa@apollo64:~/devel/test/pm$ perl -s 669148 -S=1e6 -N=13105 | wc -l Trying 13105 probes with perl 5.010000 at 669148 line 18. 1203554370 at 669148 line 20. ^C after about 500 seconds
    And on a 32 bits build under CentOS 4.x:
    [rhesa@mp-cent-07 tmp]$ /opt/perl5.10/bin/perl -s 669148 -S=1e6 -N=131 +04 | wc -l Trying 13104 probes with perl 5.010000 at 669148 line 18. 1203554602 at 669148 line 20. 1203554606 at 669148 line 27. 11376 [rhesa@mp-cent-07 tmp]$ /opt/perl5.10/bin/perl -s 669148 -S=1e6 -N=131 +05 | wc -l Trying 13105 probes with perl 5.010000 at 669148 line 18. 1203554645 at 669148 line 20. ^C after about 400 seconds
Re: 5.10.0 regex slowdown
by Limbic~Region (Chancellor) on Feb 21, 2008 at 01:14 UTC
    BrowserUk,
    Have you read perldelta paying attention to the new variable ${^RE_TRIE_MAXBUF}. Hopefully that's all it is (guessing).

    Cheers - L~R

      There seems to be precious little further info on that var, though google did turn up this

      This value by default is 65536 which corresponds to a 512kB temporary cache. Set this to a higher value to trade memory for speed when matching large alternations.

      But so far, setting it higher (2**17) or lower (256) doesn't seem to change the perform or the break point one iota. Maybe demerphq will happen by and give us the skinny on it.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        Ok, next thing you might want to look at (still guessing here) is the re pragma. Specifically the trie settings in 'Debug' mode (note 'Debug' ne 'debug'). I have sent a message to demerphq outside of PM channels in case he doesn't log in relatively soon as well.

        Cheers - L~R

        There seems to be precious little further info on that var

        Well, it's in in the perlvar manual page in my install ;-)

        But it looks like one has to use English to see its default value:

        qwurx [shmem] ~/perlmonks > perl5.10.0 -le 'print ${^RE_TRIE_MAXBUF}' qwurx [shmem] ~/perlmonks > perl5.10.0 -MEnglish -le 'print ${^RE_TRI +E_MAXBUF}' 65536

        But yes, the number 2**16 rings a bell. It looks like the TRIE-optimization has that limit (of tokens? objects? branches?). Running with -D512 reveals

        13104 probes:

        EXECUTING... Trying 13104 probes with perl 5.010000 at 669148.pl line 18. 1203590260 at 669148.pl line 20. Compiling REx "TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGTGTTTAAC| +CCTCA"... Final program: 1: TRIEC-EXACT[ACGT] (65526) <TATGTTTCGT> <CCGCTTTTTA> <CGAAGATTTC> ... <AACAGTGAGG> <GAAACTCGCG> <GAGAGATGGA> 65526: END (0) stclass AHOCORASICKC-EXACT[ACGT] minlen 10 Compiling REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TG +TGTTT"... Final program: 1: OPEN1 (3) 3: TRIEC-EXACT[ACGT] (65529) <TATGTTTCGT> <CCGCTTTTTA> <CGAAGATTTC> ... <AACAGTGAGG> <GAAACTCGCG> <GAGAGATGGA> 65529: CLOSE1 (65531) 65531: END (0) stclass AHOCORASICKC-EXACT[ACGT] minlen 10 Matching REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGT +GTTT"... against "ACTCGAATTCCGAATAGATAGAAGTCTGCTGATAATATCGCGCCGGT TCTGATGCGCCTC"... Matching stclass AHOCORASICKC-EXACT[ACGT] against "ACTCGAATTCCGAATAGAT +AGAAGTCTGCTGATAATATCGCGCCGGTTCTGATGCGCCTC"... (1000000 chars) 0 <> <ACTCGAATTC> | Charid: 2 CP: 41 State: 1, word=0 +- legal 1 <A> <CTCGAATTCC> | Charid: 4 CP: 43 State: 52, word=0 +- legal 2 <AC> <TCGAATTCCG> | Charid: 1 CP: 54 State: ce, word=0 +- legal 3 <ACT> <CGAATTCCGA> | Charid: 4 CP: 43 State: 1a2, word=0 +- legal 4 <ACTC> <GAATTCCGAA> | Charid: 3 CP: 47 State: 1a3, word=0 +- legal

        13105 probes:

        EXECUTING... Trying 13105 probes with perl 5.010000 at 669148.pl line 18. 1203590218 at 669148.pl line 20. Compiling REx "TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGTGTTTAAC| +CCTCA"... Final program: 1: TRIEC-EXACT[ACGT] (65531) <TATGTTTCGT> <CCGCTTTTTA> <CGAAGATTTC> ... <GAAACTCGCG> <GAGAGATGGA> <CGCCGAGGAT> 65531: END (0) stclass AHOCORASICKC-EXACT[ACGT] minlen 10 Compiling REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TG +TGTTT"... Final program: 1: OPEN1 (3) 3: BRANCHJ (11) 5: EXACT <TATGTTTCGT> (9) 9: LONGJMP (104850) 11: BRANCHJ (19) 13: EXACT <CCGCTTTTTA> (17) 17: LONGJMP (104850) 19: BRANCHJ (27) 21: EXACT <CGAAGATTTC> (25) 25: LONGJMP (104850) ...

        The 65531: END (0) looks - though I don't know at all what it means - just too close to 2**16 ...

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: 5.10.0 regex slowdown
by almut (Canon) on Feb 21, 2008 at 01:30 UTC

    Strange.  32-bit Perl running on Linux (SuSE 10.2) x86_64:

    $ ./669148.pl -s 668954.p10 -S=1e6 -N=13104 | wc -l Trying 1000 probes with perl 5.010000 at ./669148.pl line 18. 1203556433 at ./669148.pl line 20. 1203556436 at ./669148.pl line 27. 28418 $ ./669148.pl -s 668954.p10 -S=1e6 -N=13105 | wc -l Trying 1000 probes with perl 5.010000 at ./669148.pl line 18. 1203556454 at ./669148.pl line 20. 1203556457 at ./669148.pl line 27. 28418

    (Printing ${^RE_TRIE_MAXBUF} shows nothing)

    Update: Ahh, the passing of the params didn't work... (I overlooked the 1000 in the output). When fixing this, I can actually reproduce the problem. Sorry for the noise.

Re: 5.10.0 regex slowdown
by parv (Priest) on Feb 21, 2008 at 01:51 UTC
    /local/perl-5.10/bin/perl5.10.0 p.pl -s 668954.p10 -S=1e6 -N=13104 | +wc -l Trying 1000 probes with perl 5.010000 at p.pl line 24. 1203558362 at p.pl line 26. 1203558364 at p.pl line 33. 28418 /local/perl-5.10/bin/perl5.10.0 p.pl -s 668954.p10 -S=1e6 -N=13105 | w +c -l Trying 1000 probes with perl 5.010000 at p.pl line 24. 1203558405 at p.pl line 26. 1203558407 at p.pl line 33. 28418

    ...with Perl 5.10 on CentOS5 (GNU/Linux) 64-bit x86 ...

      /local/perl-5.10/bin/perl5.10.0 p.pl -s 668954.p10 -S=1e6 -N=13104 | wc -l

      The -s should come before the script name, which in my example was 668944.p10. Eg. If you've named the script p.pl, you should use:

      /local/perl-5.10/bin/perl5.10.0 -s p.pl -S=1e6 -N=13104 | wc -l

      The clue is on the first line of the output: Trying 1000 probes with ...


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        '1000' did seem suspicious along with '668954.p10' value, but I did not care to investigate any further (near blind copy & pasto, baby). Sorry for the useless data point.

        After starting the program correctly ...

        Trying 13104 probes with perl 5.010000 at q.pl line 19. 1203567383 at q.pl line 21. 1203567386 at q.pl line 28. 11376 Trying 13105 probes with perl 5.010000 at q.pl line 19. 1203565007 at q.pl line 21. 1203565331 at q.pl line 28. 11377

Re: 5.10.0 regex slowdown
by demerphq (Chancellor) on Feb 21, 2008 at 10:39 UTC

    I suspect that what is happening is that perl is building multiple trie objects and then executing each in turn, because the sizeof the jump offset is insufficient to jump further than 128 k forward. It highly likely that i didnt spend enough time on handling this case and some further work would improve it. I havent actually run the code to inspect the output of re=debug and i suspect its too long to post. Ill try to get it run sometime soon.

    Regarding RE_MAXBUF, what that does is set a threshold over which perl will switch to a slower and less efficient construction algorithm. If its set to 0 it disables the trie optimisation outright, but it has to be set before the pattern is compiled, in a BEGIN for instance.

    ---
    $world=~s/war/peace/g

        It would neat to see the re=debug dump of your pattern, piped into grep TRIE to see whats going on. I suspect that you can do it easiest with perl -c on code with a use re 'debug'.

        perl -c -Mre=debug SCRIPT 2>&1 | grep TRIE

        Should do it.

        Actually if you could grep for JUMP and JMP too it would be good.

        ---
        $world=~s/war/peace/g

Re: 5.10.0 regex slowdown
by ysth (Canon) on Feb 27, 2008 at 09:35 UTC
    I ran in to this too and talked with demerphq about it. It looks to me like when an alternation series gets too long, it uses BRANCHJ instead of BRANCH, because LONGJMP is needed to jump to the end of the series. And I think the trie optimization only kicks in for BRANCHes. This may be simple to fix or it may not. Any hardy souls out there want to dive in?

    A simple testcase: compare

    perl -we'use re "debug"; qr/@{[join"|","a000".."u999"]}/'
    to
    perl -we'use re "debug"; qr/@{[join"|","a000".."v999"]}/'
    (on 5.8.x as well, to see the pre-optimization difference).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://669148]
Approved by rhesa
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2014-07-31 02:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (244 votes), past polls