Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Regex match is very slow against deref'd reference to substr/lvalue, is it normal?

by Anonymous Monk
on Aug 11, 2024 at 22:33 UTC ( [id://11161007]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This is kind of continuation of questions I've been asking when bumping into unexpected regex' performance issues, the last one was 11155604, I think. This one is also observed with very fresh/latest strawberry-perl-5.40.0.1-64bit-PDL, so perhaps it's something new.

I'm trying to improve one of CPAN modules which deals with PDF, the string below simulates a classic cross-reference table, with number of entries and preceding file data roughly the same as in one of the PDF files I'm using for tests.

Method (1) is similar to the original. I tried the (4) first, with vague idea of not creating useless copies of data. However, this is when I noticed that, while other changes (not relevant here) where steady speed/memory gains, unexpectedly everything got very slow. So I concocted the SSCCE below to ask if perhaps this is a bug in Perl or not. Also, strangely, the results of (4) vary somewhat from run to run, sometimes as "fast" as 1.33 s.

(Now I think to use perhaps the (3) further, after checking if global anchor is maintained/used elsewhere by module. The question remains about bug in Perl, as accidental by-product of otherwise idle investigations)

use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; say $^V; my $s = '*' x 5_000_000; $s .= "0123456789 01234 n \n" x 40_000; my $re = qr/ (\d{10}) \x{20} (\d{5}) \x{20} (\w) \s\s /x; my ( $xref, $t ); # (1) peel off entry by entry $xref = substr $s, 5_000_000; # from shorter string $t = time; for ( 0 .. 39_999 ) { my $entry = substr $xref, $_ * 20, 20; die unless $entry =~ / \A $re /x; # do something useful with captures } say time - $t; $xref = substr $s, 5_000_000; # (2) global match (shorter string) $t = time; for ( 0 .. 39_999 ) { die unless $xref =~ / \G $re /gx; } say time - $t; # (3) global match (original string), pos( $s ) = 5_000_000; # start from pos $t = time; for ( 0 .. 39_999 ) { die unless $s =~ / \G $re /gx; } say time - $t; $xref = \substr $s, 5_000_000; # (4) use reference to substr $t = time; for ( 0 .. 39_999 ) { die unless $$xref =~ / \G $re /gx; } say time - $t; __END__ v5.40.0 0.0973920822143555 0.04703688621521 0.0475959777832031 3.08383107185364
  • Comment on Regex match is very slow against deref'd reference to substr/lvalue, is it normal?
  • Download Code

Replies are listed 'Best First'.
Re: Regex match is very slow against deref'd reference to substr/lvalue, is it normal?
by dave_the_m (Monsignor) on Aug 12, 2024 at 07:17 UTC
    It's nothing to do with regexes and everything to do with lvalue substr. Taking a reference to substr doesn't create a new string, it just creates a special magical object with a pointer to the original string and the values of the substr arguments. When $xref is derefererenced, it triggers the magic which makes an actual string based on the values. Since you have $$xref in a loop, that causes a temporary copy of the substring to be made and thrown away 40,000 times.

    Dave.

      Thanks. No regexes involved, then. My confusion, partly, was because the "magical object" (MO), in addition to attributes you mention, also keeps value of global anchor position between global match invocations, even though temporary buffer is thrown away each time, as you explained. So, MO is smart enough to notice if referent has been changed to reset this anchor. But not smart enough to somehow employ COW (if it observes referent for changes anyway) to avoid physical copy on every dereference? I thought Perl scalar keeps offset and actual length of string, even though physical buffer may extend on either side? I.e. same as result of substr. OK, perhaps reality of MO is more complex than my simple model above, thanks for your answer, again.

Re: Regex match is very slow against deref'd reference to substr/lvalue, is it normal?
by Discipulus (Canon) on Aug 12, 2024 at 10:11 UTC
    Hello,

    you already got the reply, just to add something odd to me: 5.28.0 is really slower than other strawberry distros I have:

    | v5.12.3 | 0.0323190689086914 | 0.0214970111846924 | 0.0220029354095459 | 1.32348299026489 [OK] .\perl5.12-32bit\perl\bin\perl.exe | v5.20.3 | 0.0418579578399658 | 0.0302300453186035 | 0.0331590175628662 | 1.98582315444946 [OK] .\perl5.20.64bit\perl\bin\perl.exe | v5.22.3 | 0.039478063583374 | 0.0311381816864014 | 0.0301609039306641 | 3.66861701011658 [OK] .\perl5.22.64bit\perl\bin\perl.exe | v5.24.2 | 0.0452229976654053 | 0.0350570678710938 | 0.035099983215332 | 2.05799889564514 [OK] .\perl5.24.64bit\perl\bin\perl.exe | v5.26.0 | 0.0399658679962158 | 0.032555103302002 | 0.0302999019622803 | 2.04154109954834 [OK] .\perl5.26.64bit\perl\bin\perl.exe | v5.26.2 | 0.0418398380279541 | 0.0300610065460205 | 0.0328929424285889 | 3.8220419883728 [OK] .\perl-5.26.64bit-PDL\perl\bin\perl.exe | v5.28.0 | 0.0486619472503662 | 0.0337529182434082 | 0.0334742069244385 | 10.2188839912415 [OK] .\perl5.28.32bit\perl\bin\perl.exe | v5.28.1 | 0.0384888648986816 | 0.0307059288024902 | 0.0345809459686279 | 2.07106709480286 [OK] .\perl5.28-64bit\perl\bin\perl.exe | v5.32.0 | 0.0425641536712646 | 0.0332610607147217 | 0.0336699485778809 | 4.36432099342346 [OK] .\perl5.32.64bit\perl\bin\perl.exe

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11161007]
Approved by GrandFather
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-09-17 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (22 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.