comment on

First advice: you can add spaces and comments to make long regexes easier to read with the /x modifier. Second, instead of /regex/ you can write m<regex> (see perlop). Or you can save a regex in a variable using qr

use Data::Dumper;

$regex = qr<
\b
(?:                                 # Non capturing group

  ## Case 1: currency|foreign exchange comes second
  (revenues?|sales|growth)          # group 1
  \W+
  (?:\w+\W+){0,4}?                  # Non capturing group
  (currency|foreign\Wexchange)      # group 2

  |
  
  ## Case 2: currency|foreign exchange comes first
  (currency|foreign\Wexchange)      # group 3
  \W+
  (?:\w+\W+){0,4}?                  # Non capturing group
  (revenues?|sales|growth)          # group 4
  
)
\b
>x;

$text = <<END_OF_TEXT;
foreign exchange revenue
currency revenue
END_OF_TEXT

print Dumper [ $text =~ /$regex/gi ]
[download]

Now, there are 4 groups, and you get four times the number of matches. It's simply because a /g regex (according to Regexp Quote Like Operators):

In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.

So your 8 elements list is actually the 4 groups for the first match, followed by the 4 groups in the second match. In both matches you have string in group 3 and 4, because (currency|foreign exchange) comes first.

If you just want to count the matches, without getting the word that matched, just turn all capturing parentheses (text) into non capturing ones (?:text) : just try my exemple first as is, and then by modifying the parentheses. If you want to know which word matches, the more beginner-friendly way I can think of is to loop on iterations of the regex and read either group2 or group4.

Do know that this won't work for all cases though, if you have two matching words in the neighbourhood of the same (currency|foreign exchange), just one will be counted. For exemple in "Currency revenue sales growth", you'll just get "revenue" because the next match attempt will start after "revenue" and "currency" won't be visible anymore.

In reply to Re: Problems counting regex matches by Eily
in thread Problems counting regex matches by elcilorien

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks