Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Interesting Perl/Java regexp benchmarking

by dreadpiratepeter (Priest)
on Aug 23, 2004 at 15:03 UTC ( [id://385094]=perlmeditation: print w/replies, xml ) Need Help??

A friend sent me a url with a comparison of Perl and Java regexp parsing where the Java parsing outperformed the Perl parsing 2 to 1 for a particular expression.
Unfortunately I lack the cycles to verify his results so I thought I'd post it here in the hopes that a fellow monk might be inspired to verify or de-bunk the claim.

The Comparison


-pete
"Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."
  • Comment on Interesting Perl/Java regexp benchmarking

Replies are listed 'Best First'.
Re: Interesting Perl/Java regexp benchmarking
by tilly (Archbishop) on Aug 23, 2004 at 16:08 UTC
    Without testing I'd have to say that I would not be surprised if Perl really was slower. Perl also tends to lose on complex regular regular expressions to CL-PPCRE for Perl 5.6 on up. And probably others.

    However take the string ("foo " x 100)."fo" and try to match it to the regular expression /^(\s*foo\s*)*$/ and Perl 5.6 and up immediately figures out that it doesn't match, while the others won't make significant progress in the projected lifetime of the Sun.

    The two issues are connected: to keep the disaster expression from being a disaster, Perl keeps track of extra information during a match to avoid redoing the same work again and again. On most regular expressions this tracking takes a lot of work and buys you little to nothing. However every so often you aren't left wondering why your program unexpectedly froze.

Re: Interesting Perl/Java regexp benchmarking
by Anonymous Monk on Aug 23, 2004 at 21:15 UTC
    Here is my attempt at commenting the perl regex Bray uses. Note that Unicode sub-properties are heavily used.
    (<[^/]([^>]*[^/>])?>) # xml start tag, e.g. <tag> |(</[^>]*>) # xml end tag, e.g. </tag> |(<[^>]*/>) # xml empty tag, e.g. <tag /> | # Get text at least two characters long that begins with a letter, # number, or CJK ideograph, that may also contain some # punctuation, and that ends with a letter, number, or CJK # ideograph. ( ( \p{Lu} # Uppercase Letter |\p{Ll} # Lowercase Letter |\p{Lt} # Titlecase Letter |\p{Nd} # Decimal Digit Number, i.e. 0-9 |\p{Nl} # Letter Number, e.g. Roman numerals |\p{No} # Other Number |[\x{4e00}-\x{9fa5}] # CJK Unified Ideographs |\x{3007} # Ideographic number zero |[\x{3021}-\x{3029}] # Hangzhou numerals 1-9 ) ( ( \p{Lu} |\p{Ll} |\p{Lt} |\p{Nd} |\p{Nl} |\p{No} |[-._:'] # some punctuation |[\x{4e00}-\x{9fa5}] |\x{3007} |[\x{3021}-\x{3029}] )* ( \p{Lu} |\p{Ll} |\p{Lt} |\p{Nd} |\p{Nl} |\p{No} |[\x{4e00}-\x{9fa5}] |\x{3007} |[\x{3021}-\x{3029}] ) )? )

      Thanks for taking the time to reformat this. It highlights what I already thought when I glanced at the regexp in the article. I think the writer is confusing capturing and grouping.

      I suspect that all the sub-captures for picking up the tricky Unicode stuff are not necessary. In fact, I don't think the alternations for picking up element begin and end tags need to captured either.

      They should all be grouped (read: (?:...)) and a single capturing paren around the whole mess. You could then walk down the stream with a while:

      my $token; while( $token = $stream =~ /( <[^/]([^>]*[^/>])?> | </[^>]*> | <[^>]*/> | (?: ... ) # unicode goop )/gx ) { print $token; }

      Making all those subcaptures available in $1, $2, $3... takes a significant amount of time which could account for Perl's poor showing. (Disclaimer: I have no idea whether Java makes the same distinction between capturing and grouping. If it does so, it's expending as much effort as Perl and my reasoning would be incorrect).

      - another intruder with the mooring of the heat of the Perl

Re: Interesting Perl/Java regexp benchmarking
by glwtta (Hermit) on Aug 23, 2004 at 16:08 UTC
    8 minutes for a 100MB file? Sounds a little excessive.

    Unfortunately there's not much one can say about it without having the file he was parsing on hand.

Re: Interesting Perl/Java regexp benchmarking
by perlfan (Parson) on Aug 24, 2004 at 13:08 UTC
    Unfortunately, they don’t produce quite the same results, with occasional variation around international characters. But close enough for the purpose of this experiment.

    This could introduce HUGE discrepancies in regards to what is actually going on because regexs are compiled into FSAs, and you can't accurately compare FSAs that have different end states.
Re: Interesting Perl/Java regexp benchmarking
by Anonymous Monk on Sep 09, 2004 at 12:38 UTC
    So, we have one test, of which we don't know what the test program is, nor the data it acted upon. We do know that the results are different. Furthermore, the Java version is supposed to be over 5 minutes faster - however, if the results are displayed, those 5 minutes more than vanish.

    I can't really say that such a "test" impresses me.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://385094]
Approved by ybiC
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2025-07-17 03:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.