Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
This comes from Truncating Last Sentence. This is a discussion on why regexes might not be so hot in a particular arena, and why my module helps out.

I'd like it to be known that, on a string with many X's in it, doing

$str =~ s/X[^X]*$/X/;
is not very fast. It tries to match at each X it finds. You'd probably be better off doing
$str =~ s/(.*X).*/$1/;
Although in that case, it'd be sweet to use Regexp::Keep and say:
$str =~ s/.*X\K.*//;
or do it the two-regex way (since \K isn't core Perl, it's not as fast):
$str =~ /.*X/g and $str =~ s/\G.*//;
Or you could reverse the string:
($str = reverse $str) =~ s/^[^X]+//; $str = reverse $str;
Of course, for this task, a regex is probably the wrong tool. You can use string functions:
substr($str, rindex($str, "X")+1) = "";
Here's a benchmark of these methods (leaving out \K, because it's not worth showing):
use Benchmark 'cmpthese'; my $str = "alphabet X alphabet" x 100 . "junk at the end" x 10; cmpthese(-5, { # is this the last X? last => sub { my $x = $str; $x =~ s/X[^X]*$/X/ }, # capture up to the last X and replace it with itself capt_repl => sub { my $x = $str; $x =~ s/(.*X).*/$1/ }, # match up to the last X, then remove everything after it rx_rx => sub { my $x = $str; $x =~ /.*X/g and $x =~ s/\G.*// }, # reverse, remove up to first X, reverse sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^X]+//; $x = reverse $x; }, # find the last X, remove everything after it substr => sub { my $x = $str; substr($x, rindex($x, "X")+1) = "" }, }); __END__ Rate last sexeger capt_repl rx_rx substr last 6398/s -- -84% -84% -90% -97% sexeger 39625/s 519% -- -4% -37% -83% capt_repl 41274/s 545% 4% -- -35% -82% rx_rx 63380/s 891% 60% 54% -- -73% substr 230796/s 3507% 482% 459% 264% --
Notice how much better-suited substr() is for this task.

You encounter a problem in speed, though, when X is not just a character, but a character class. First of all, our substr() approach fails immediately, because it uses index(), which looks for a substring, not one of a set of characters. Let's do the same benchmark, but change X to the character class of A-Z.

cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, }); __END__ Rate last capt_repl rx_rx sexeger last 5591/s -- -82% -86% -86% capt_repl 30599/s 447% -- -21% -24% rx_rx 38948/s 597% 27% -- -3% sexeger 40049/s 616% 31% 3% --
This slow-down is caused by the character class. Because we're not looking for a SINGLE character, we can't jump backwards (the regex engine knows how to handle /.*A/ quickly -- it can "jump" backward to an "A", instead of examining each character -- but it can't handle /.*[AB]/ as quickly).

So what can we do? We can use String::Index, which gives us functions that act like C's strpbrk(), but can do even more. For those of you not familiar with strpbrk() (whose name I can't decipher), it takes a string to look at and a string of characters to find in that source string.

/* C code */ printf("%s", strpbrk("breakfast", "aeiou")); /* prints: eakfast */
In Perl, it'd be like doing:
if ("breakfast" =~ /[aeiou]/) { $x = substr("breakfast", $-[0]); }
That is, it returns the earliest location in the source string of one of the characters in the second string. It's uncool that there's no standard C function that does this from the back of the string, or for all characters except those given...

That's what the String::Index module was written to do! Let's apply it to this problem and run another benchmark:

# crindex is char-class-based rindex() use String::Index 'crindex'; cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, crindex => sub { my $x = $str; substr($x, crindex($x, "ABCDEFGHIJKLMNOPQRSTUVWXYZ")+1) = ""; }, }); __END__ Rate last capt_repl rx_rx sexeger crindex last 5581/s -- -82% -85% -86% -90% capt_repl 30606/s 448% -- -20% -23% -44% rx_rx 38194/s 584% 25% -- -4% -30% sexeger 39594/s 609% 29% 4% -- -28% crindex 54829/s 882% 79% 44% 38% --
We are restored. It's not as fast as the original substr() approach because it has to do more work, but it's faster than any other solution.
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

In reply to Regexes are slow (or, why I advocate String::Index) by japhy

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and a kettle whistles...

    How do I use this? | Other CB clients
    Other Users?
    Others avoiding work at the Monastery: (5)
    As of 2018-07-21 17:32 GMT
    Find Nodes?
      Voting Booth?
      It has been suggested to rename Perl 6 in order to boost its marketing potential. Which name would you prefer?

      Results (449 votes). Check out past polls.