Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Confused by RegEx count

by Melly (Chaplain)
on Feb 20, 2024 at 09:18 UTC ( [id://11157803]=perlquestion: print w/replies, xml ) Need Help??

Melly has asked for the wisdom of the Perl Monks concerning the following question:


I came across these bits of code on StackOverflow the other day, both of which count the occurrence of a character in a string, and for the life of me, I don't understand what's going on, or why all those steps are necessary:

If the character is constant, the following is best:
my $count = $str =~ tr/y//;
If the character is variable, I'd use the following:
my $count = length( $str =~ s/[^\Q$char\E]//rg );

Can anyone help me understand what's going on here?

map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
Tom Melly, pm (at) cursingmaggot (stop) co (stop) uk

Replies are listed 'Best First'.
Re: Confused by RegEx count
by hippo (Bishop) on Feb 20, 2024 at 09:50 UTC

    Let's examine the first one to explain what it is doing and how you could find this out. As usual, start at the right of the expression with tr/y// where tr is the transliteration operator. Note that this operator does not work with regular expressions but just deals with lists of characters. Every character in the first list (y) found in the operand ($str) is replaced by the equivalent character in the second list (empty) or if, as in this case, there is no equivalent character in the second list, the matching character is left as it is. tr then returns the number of characters in the operand so treated. If there were no explicit operand it would use $_ by default.

    $ perl -E '$str="Fly guys try yoyos"; my $c = $str =~ tr/y//; say $c;' 5

    How do you discover all this? Well, perldoc will tell you all about tr - you just need to know that the docs are the place to look. Try perldoc -f tr for all the good stuff.

    With that in mind, can you now work out what is going on in the second example?


Re: Confused by RegEx count
by eyepopslikeamosquito (Archbishop) on Feb 20, 2024 at 14:00 UTC

    If the character is constant, the following is best:

    my $count = $str =~ tr/y//;

    Can anyone help me understand what's going on here?

    Certainly. In the general case, Perl's tr operator transliterates all occurrences of the characters found, returning the number of characters replaced or deleted (for example, $str =~ tr/y/z/ changes all y characters in $str to z, returning the number of changes made). In the special case of an empty replacement list, as in your $str =~ tr/y// above, this operator does not change the string, it simply returns the number of matching characters found, in this case the number of y characters in the string.

    Note that the tr operator further supports various options, such as c to complement the search list, and that y is a synonym for tr (added to entice diehard sed users to Perl) ... so instead of $str =~ tr/y// you could equivalently write $str =~ y/y// to count the number of y characters in $str ... finally giving us enough background to understand Abigail's famous length horror:


    which gives the same result as the prosaic length function, but is one character shorter. :-)

    Updated: removed accidental duplicate sentence and tweaked formatting.

Re: Confused by RegEx count
by Athanasius (Archbishop) on Feb 21, 2024 at 04:55 UTC

    Hello Melly,

    I see you’ve been given detailed explanations of the first snippet, which uses the transliteration operator tr///, but not of the second, which uses the substitution operator s///:

    my $count = length( $str =~ s/[^\Q$char\E]//rg );

    There is actually quite a bit of detail to unpack here. The effect of the code is to create a string from which all non-$char characters have been removed, and then count the remaining characters in that string (i.e., just the $char characters). Removal of the non-$char characters is accomplished by replacing them with the empty string in the replacement part of the substitution.

    • The /g modifier causes the substitution to be performed globally, i.e., repeatedly throughout the string. See
    • The /r modifier causes the substitution to be performed on a new string, which is returned, leaving the original string unchanged. See the same reference.
    • The square brackets create a character class, on which see
    • Within the character class, the initial ^ (caret) character negates the class, meaning it now matches any characters except those specified. See
    • The \Q escape sequence removes the special meaning from any metacharacters following it. For example, if $char has the value "]" (right square bracket) and is not escaped, the compiler will see it as the end-delimiter of the character class. The \Q ensures that the compiler sees it as merely the character "]". See
    • The \E escape sequence restores their normal meaning to any metacharacters that follow. See the same reference.
    • length is the built-in Perl function which returns the number of characters in a string.

    Hope that helps,

    Athanasius <°(((><contra mundum סתם עוד האקר של פרל,

Re: Confused by RegEx count
by choroba (Cardinal) on Feb 20, 2024 at 23:38 UTC
    Other monks have already explained what's going on. Let me point to efficiency of the solutions:

    Note that the transliteration is much faster than the other option. Even when the character is variable and we have to use string eval (whip! whip!), it's much faster.

    Instead of using substitution with length, you can use global substitution only, as it returns the number of replacements in scalar context. But it's still slower than transliteration:

    #! /usr/bin/perl use warnings; use strict; use Benchmark qw{ cmpthese }; my $orig = 'Just another Perl hacker,' x 100; my $str = $orig; my $char = 'r'; my $q = quotemeta $char; sub transliteration { my $count = eval "\$str =~ tr/$q//" } sub length_subst { my $count = length( $str =~ s/[^$q]//rg ) } sub subst { my $count = $str =~ s/$q/$char/g } transliteration() eq length_subst() or die 'Different t-ls'; transliteration() eq subst() or die 'Different t-s'; $orig eq $str or die 'Changed'; cmpthese(-3, { transliteration => \&transliteration, length_subst => \&length_subst, subst => \&subst, }); __END__ Rate length_subst subst transliterati +on length_subst 2833/s -- -91% -9 +7% subst 30244/s 968% -- -7 +0% transliteration 102423/s 3515% 239% +--

    Update: Introduced quotemeta to transliteration, too. It didn't change the results significantly.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Or you could just match instead of substitute, for a few percent faster.
      sub match { my $count =()= $str =~ /$q/g }
        Interestingly, on my machine:
        Rate length_subst match subst trans +literation length_subst 2864/s -- -89% -90% + -97% match 25687/s 797% -- -13% + -74% subst 29356/s 925% 14% -- + -70% transliteration 98682/s 3346% 284% 236% + --

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Confused by RegEx count
by eyepopslikeamosquito (Archbishop) on Feb 21, 2024 at 23:40 UTC

    Melly, I found a couple of articles discussing your interesting challenge from two well-respected Perl book authors:

    From brian_d_foy's article:

    Perl has a few ways to do it (many languages do too, but no one cares) because there are actually several different problems you could be solving. A good answer takes into account the level of problem that you are trying to solve.

    So if you provided more background and context on the specific problem you're trying to solve, we could provide better advice.

    While the benchmarks of this little problem are interesting and instructive, code clarity and maintainability are usually more important than performance, as analysed in much more detail at on Code Optimization.

Re: Confused by RegEx count
by BillKSmith (Monsignor) on Feb 22, 2024 at 19:52 UTC
    Note that tr/SEARCHLIST/REPLACEMENTLIST/ is documented with "Quote Like Operators". That document shows that it can be used with variables with a little help from eval.
    Because the transliteration table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation. That means that if you want to use variables, you must use an eval():
    eval "tr/$oldlist/$newlist/"; die $@ if $@;
    eval "tr/$oldlist/$newlist/, 1"; die $@ if $@;

    In your case that would be

    my $count = eval "\$str =~ tr/$char//" or die $@ if $@;
Re: Confused by RegEx count
by Melly (Chaplain) on Mar 01, 2024 at 10:18 UTC

    Many thanks to all - really helpful stuff!

    map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
    Tom Melly, pm (at) cursingmaggot (stop) co (stop) uk
Re: Confused by RegEx count
by Danny (Pilgrim) on Feb 20, 2024 at 10:46 UTC
    Can anyone help me understand what's going on here?
    I'm not sure what your question is.

    EDIT: This was getting some bad press, I assume because there was an assumption that I was being condescending or something. The replies to this assumed the OP didn't understand how transliteration and substitution matching work, but this didn't even occur to me. I was genuinely unsure about what the question was and was asking for clarification.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11157803]
Approved by marto
Front-paged by Bod
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-05-29 10:52 GMT
Find Nodes?
    Voting Booth?

    No recent polls found