Re: 5.10.0 regex slowdown
by Limbic~Region (Chancellor) on Feb 21, 2008 at 01:14 UTC
|
BrowserUk,
Have you read perldelta paying attention to the new variable ${^RE_TRIE_MAXBUF}. Hopefully that's all it is (guessing).
| [reply] |
|
There seems to be precious little further info on that var, though google did turn up this
This value by default is 65536 which corresponds to a 512kB temporary cache. Set this to a higher value to trade memory for speed when matching large alternations.
But so far, setting it higher (2**17) or lower (256) doesn't seem to change the perform or the break point one iota. Maybe demerphq will happen by and give us the skinny on it.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
BrowserUk,
Ok, next thing you might want to look at (still guessing here) is the re pragma. Specifically the trie settings in 'Debug' mode (note 'Debug' ne 'debug'). I have sent a message to demerphq outside of PM channels in case he doesn't log in relatively soon as well.
| [reply] |
|
There seems to be precious little further info on that var
Well, it's in in the perlvar manual page in my install ;-)
But it looks like one has to use English to see its default value:
qwurx [shmem] ~/perlmonks > perl5.10.0 -le 'print ${^RE_TRIE_MAXBUF}'
qwurx [shmem] ~/perlmonks > perl5.10.0 -MEnglish -le 'print ${^RE_TRI
+E_MAXBUF}'
65536
But yes, the number 2**16 rings a bell. It looks like the TRIE-optimization has that
limit (of tokens? objects? branches?). Running with -D512 reveals
13104 probes:
EXECUTING...
Trying 13104 probes with perl 5.010000 at 669148.pl line 18.
1203590260 at 669148.pl line 20.
Compiling REx "TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGTGTTTAAC|
+CCTCA"...
Final program:
1: TRIEC-EXACT[ACGT] (65526)
<TATGTTTCGT>
<CCGCTTTTTA>
<CGAAGATTTC>
...
<AACAGTGAGG>
<GAAACTCGCG>
<GAGAGATGGA>
65526: END (0)
stclass AHOCORASICKC-EXACT[ACGT] minlen 10
Compiling REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TG
+TGTTT"...
Final program:
1: OPEN1 (3)
3: TRIEC-EXACT[ACGT] (65529)
<TATGTTTCGT>
<CCGCTTTTTA>
<CGAAGATTTC>
...
<AACAGTGAGG>
<GAAACTCGCG>
<GAGAGATGGA>
65529: CLOSE1 (65531)
65531: END (0)
stclass AHOCORASICKC-EXACT[ACGT] minlen 10
Matching REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGT
+GTTT"... against "ACTCGAATTCCGAATAGATAGAAGTCTGCTGATAATATCGCGCCGGT
TCTGATGCGCCTC"...
Matching stclass AHOCORASICKC-EXACT[ACGT] against "ACTCGAATTCCGAATAGAT
+AGAAGTCTGCTGATAATATCGCGCCGGTTCTGATGCGCCTC"... (1000000 chars)
0 <> <ACTCGAATTC> | Charid: 2 CP: 41 State: 1, word=0
+- legal
1 <A> <CTCGAATTCC> | Charid: 4 CP: 43 State: 52, word=0
+- legal
2 <AC> <TCGAATTCCG> | Charid: 1 CP: 54 State: ce, word=0
+- legal
3 <ACT> <CGAATTCCGA> | Charid: 4 CP: 43 State: 1a2, word=0
+- legal
4 <ACTC> <GAATTCCGAA> | Charid: 3 CP: 47 State: 1a3, word=0
+- legal
13105 probes:
EXECUTING...
Trying 13105 probes with perl 5.010000 at 669148.pl line 18.
1203590218 at 669148.pl line 20.
Compiling REx "TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TGTGTTTAAC|
+CCTCA"...
Final program:
1: TRIEC-EXACT[ACGT] (65531)
<TATGTTTCGT>
<CCGCTTTTTA>
<CGAAGATTTC>
...
<GAAACTCGCG>
<GAGAGATGGA>
<CGCCGAGGAT>
65531: END (0)
stclass AHOCORASICKC-EXACT[ACGT] minlen 10
Compiling REx "((?-xism:TATGTTTCGT|CCGCTTTTTA|CGAAGATTTC|GAACGACGGC|TG
+TGTTT"...
Final program:
1: OPEN1 (3)
3: BRANCHJ (11)
5: EXACT <TATGTTTCGT> (9)
9: LONGJMP (104850)
11: BRANCHJ (19)
13: EXACT <CCGCTTTTTA> (17)
17: LONGJMP (104850)
19: BRANCHJ (27)
21: EXACT <CGAAGATTTC> (25)
25: LONGJMP (104850)
...
The 65531: END (0) looks - though I don't know at all what it means - just too close to 2**16 ...
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] [d/l] [select] |
|
|
|
|
|
Re: 5.10.0 regex slowdown
by rhesa (Vicar) on Feb 21, 2008 at 00:54 UTC
|
rhesa@apollo64:~/devel/test/pm$ perl -s 669148 -S=1e6 -N=13104 | wc -l
Trying 13104 probes with perl 5.010000 at 669148 line 18.
1203554358 at 669148 line 20.
1203554362 at 669148 line 27.
11376
rhesa@apollo64:~/devel/test/pm$ perl -s 669148 -S=1e6 -N=13105 | wc -l
Trying 13105 probes with perl 5.010000 at 669148 line 18.
1203554370 at 669148 line 20.
^C after about 500 seconds
And on a 32 bits build under CentOS 4.x:
[rhesa@mp-cent-07 tmp]$ /opt/perl5.10/bin/perl -s 669148 -S=1e6 -N=131
+04 | wc -l
Trying 13104 probes with perl 5.010000 at 669148 line 18.
1203554602 at 669148 line 20.
1203554606 at 669148 line 27.
11376
[rhesa@mp-cent-07 tmp]$ /opt/perl5.10/bin/perl -s 669148 -S=1e6 -N=131
+05 | wc -l
Trying 13105 probes with perl 5.010000 at 669148 line 18.
1203554645 at 669148 line 20.
^C after about 400 seconds
| [reply] [d/l] [select] |
Re: 5.10.0 regex slowdown
by demerphq (Chancellor) on Feb 21, 2008 at 10:39 UTC
|
I suspect that what is happening is that perl is building multiple trie objects and then executing each in turn, because the sizeof the jump offset is insufficient to jump further than 128 k forward. It highly likely that i didnt spend enough time on handling this case and some further work would improve it.
I havent actually run the code to inspect the output of re=debug and i suspect its too long to post. Ill try to get it run sometime soon.
Regarding RE_MAXBUF, what that does is set a threshold over which perl will switch to a slower and less efficient construction algorithm. If its set to 0 it disables the trie optimisation outright, but it has to be set before the pattern is compiled, in a BEGIN for instance.
---
$world=~s/war/peace/g
| [reply] |
|
| [reply] |
|
perl -c -Mre=debug SCRIPT 2>&1 | grep TRIE
Should do it.
Actually if you could grep for JUMP and JMP too it would be good.
---
$world=~s/war/peace/g
| [reply] [d/l] |
|
|
Re: 5.10.0 regex slowdown
by almut (Canon) on Feb 21, 2008 at 01:30 UTC
|
$ ./669148.pl -s 668954.p10 -S=1e6 -N=13104 | wc -l
Trying 1000 probes with perl 5.010000 at ./669148.pl line 18.
1203556433 at ./669148.pl line 20.
1203556436 at ./669148.pl line 27.
28418
$ ./669148.pl -s 668954.p10 -S=1e6 -N=13105 | wc -l
Trying 1000 probes with perl 5.010000 at ./669148.pl line 18.
1203556454 at ./669148.pl line 20.
1203556457 at ./669148.pl line 27.
28418
(Printing ${^RE_TRIE_MAXBUF} shows nothing)
Update: Ahh, the passing of the params didn't work... (I
overlooked the 1000 in the output). When fixing this, I can actually reproduce the problem. Sorry for the noise.
| [reply] [d/l] [select] |
Re: 5.10.0 regex slowdown
by parv (Vicar) on Feb 21, 2008 at 01:51 UTC
|
/local/perl-5.10/bin/perl5.10.0 p.pl -s 668954.p10 -S=1e6 -N=13104 |
+wc -l
Trying 1000 probes with perl 5.010000 at p.pl line 24.
1203558362 at p.pl line 26.
1203558364 at p.pl line 33.
28418
/local/perl-5.10/bin/perl5.10.0 p.pl -s 668954.p10 -S=1e6 -N=13105 | w
+c -l
Trying 1000 probes with perl 5.010000 at p.pl line 24.
1203558405 at p.pl line 26.
1203558407 at p.pl line 33.
28418
...with Perl 5.10 on CentOS5 (GNU/Linux) 64-bit x86 ...
| [reply] [d/l] [select] |
|
/local/perl-5.10/bin/perl5.10.0 -s p.pl -S=1e6 -N=13104 | wc -l
The clue is on the first line of the output: Trying 1000 probes with ...
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
'1000' did seem suspicious along with '668954.p10' value, but I did not care to investigate any further (near blind copy & pasto, baby). Sorry for the useless data point.
After starting the program correctly ...
Trying 13104 probes with perl 5.010000 at q.pl line 19.
1203567383 at q.pl line 21.
1203567386 at q.pl line 28.
11376
Trying 13105 probes with perl 5.010000 at q.pl line 19.
1203565007 at q.pl line 21.
1203565331 at q.pl line 28.
11377
| [reply] [d/l] |
Re: 5.10.0 regex slowdown
by ysth (Canon) on Feb 27, 2008 at 09:35 UTC
|
perl -we'use re "debug"; qr/@{[join"|","a000".."u999"]}/'
to
perl -we'use re "debug"; qr/@{[join"|","a000".."v999"]}/'
(on 5.8.x as well, to see the pre-optimization difference).
| [reply] [d/l] [select] |