Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Problems counting regex matches

by Eily (Monsignor)
on Jan 15, 2014 at 17:51 UTC ( [id://1070718]=note: print w/replies, xml ) Need Help??


in reply to Problems counting regex matches

First advice: you can add spaces and comments to make long regexes easier to read with the /x modifier. Second, instead of /regex/ you can write m<regex> (see perlop). Or you can save a regex in a variable using qr

use Data::Dumper; $regex = qr< \b (?: # Non capturing group ## Case 1: currency|foreign exchange comes second (revenues?|sales|growth) # group 1 \W+ (?:\w+\W+){0,4}? # Non capturing group (currency|foreign\Wexchange) # group 2 | ## Case 2: currency|foreign exchange comes first (currency|foreign\Wexchange) # group 3 \W+ (?:\w+\W+){0,4}? # Non capturing group (revenues?|sales|growth) # group 4 ) \b >x; $text = <<END_OF_TEXT; foreign exchange revenue currency revenue END_OF_TEXT print Dumper [ $text =~ /$regex/gi ]

Now, there are 4 groups, and you get four times the number of matches. It's simply because a /g regex (according to Regexp Quote Like Operators):

In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.
So your 8 elements list is actually the 4 groups for the first match, followed by the 4 groups in the second match. In both matches you have string in group 3 and 4, because (currency|foreign exchange) comes first.

If you just want to count the matches, without getting the word that matched, just turn all capturing parentheses (text) into non capturing ones (?:text) : just try my exemple first as is, and then by modifying the parentheses. If you want to know which word matches, the more beginner-friendly way I can think of is to loop on iterations of the regex and read either group2 or group4.

Do know that this won't work for all cases though, if you have two matching words in the neighbourhood of the same (currency|foreign exchange), just one will be counted. For exemple in "Currency revenue sales growth", you'll just get "revenue" because the next match attempt will start after "revenue" and "currency" won't be visible anymore.

Replies are listed 'Best First'.
Re^2: Problems counting regex matches
by AnomalousMonk (Archbishop) on Jan 15, 2014 at 18:50 UTC

    Just to clarify, when used in an alternation in list context, a capture group always returns something if it matches or not. If it does not match, undef is returned. One way to filter out these undefs is with a grep.

    >perl -wMstrict -MData::Dump -le "my $s = 'AAA BBB CCC DDD AAA BBB'; ;; my @captures = $s =~ m{ (AAA) | (BBB) | (XXX) | (YYY) }xmsg; dd \@captures ;; my @matches = grep defined, $s =~ m{ (AAA) | (BBB) | (XXX) | (YYY) }x +msg; dd \@matches; " [ "AAA", undef, undef, undef, undef, "BBB", undef, undef, "AAA", undef, undef, undef, undef, "BBB", undef, undef, ] ["AAA", "BBB", "AAA", "BBB"]

    Update: Another way to capture only matches is with the "branch reset" extended pattern (see  "(?|pattern)" in Extended Patterns in perlre) available with Perl version 5.10+.

    >perl -wMstrict -MData::Dump -le "use 5.010; ;; my $s = 'AAA BBB CCC DDD AAA BBB'; ;; my @matches = $s =~ m{ (?| (AAA) | (BBB) | (XXX) | (YYY) ) }xmsg; dd \@matches " ["AAA", "BBB", "AAA", "BBB"]
Re^2: Problems counting regex matches
by AnomalousMonk (Archbishop) on Jan 16, 2014 at 00:05 UTC
    ... this won't work for all cases though, if you have two matching words in the neighbourhood of the same (currency|foreign exchange), just one will be counted. For exemple in "Currency revenue sales growth", you'll just get "revenue" ...

    The following works for overlapping matches. It also needs 5.10+ because in addition to (?|pattern), it uses (*FAIL) from the Special Backtracking Control Verbs introduced in that version. The variation that only counts occurrences may be a little faster.

    Output:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1070718]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-23 19:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found