Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Regex question

by cajun (Chaplain)
on Aug 08, 2005 at 06:34 UTC ( #481801=perlquestion: print w/replies, xml ) Need Help??
cajun has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on some custom rules for SpamAssassin and I've come across a question I really don't quite understand and have not yet been able to find the answer to.
#!/usr/bin/perl use warnings; use strict; while (<DATA>){ print "Match on line $. with []\n" if (/[htm|asp]/); print "Match on line $. with ()\n" if (/(htm|asp)/); } __DATA__ optout.htm optout.asp optout.php OUTPUT: Match on line 1 with [] Match on line 1 with () Match on line 2 with [] Match on line 2 with () Match on line 3 with []
What I'm not clear on is the difference between the actions of the [ ] and ( ). From the actions of my test script, I believe the (htm|asp) is an 'or' condition (htm or asp). I'm not clear on why the [htm|asp] is matching on the third line of data.

Links to more info appreciated.


Update: Thanks davido and Enlil. Perfect explanations. Thanks for the links for more info too.

Replies are listed 'Best First'.
Re: Regex question
by davido (Archbishop) on Aug 08, 2005 at 06:40 UTC

    (htm|asp) means match completely either 'htm' or 'asp'. [htm|asp] means match any of the following characters: 'h', 't', 'm', '|', 'a', 's', 'p'. There is no '|' alternation operator in a [...] character class. This is discussed in better detail in perlrequick (from the core Perl documentation).

    In your case, the third line is matching 't' from "optout" via the character class which happens to include the letter 't'.


Re: Regex question
by Enlil (Parson) on Aug 08, 2005 at 06:44 UTC
    The difference is that with [] is part of a character class, which means that anything (in this case a line) within the [] will constitute a match at this position. So in your code, [htm|asp] anything that contains any of the following characters will match: h t m | a s p.

    On the other hand with parenthesis you are asking the regex engine to match anything with htm OR asp, and hence output.php doesn't match.

    You might want to have a look at perlretut and perlre for more info.


Re: Regex question
by sk (Curate) on Aug 08, 2005 at 06:46 UTC
    [] - matches any character specified inside

    () - used to store the value of the match and very useful in substitution where you can use matched words using $1,$2 etc. Not very useful the way you have it in your code i.e. not useful for checking if a pattern exist

    In your example the last line matched on [] because there was a h/p!

    It did not match for () because it was looking asp/html as proper words.

Re: Regex question
by GrandFather (Sage) on Aug 08, 2005 at 06:46 UTC

    See perlretut for a start

    [] matches any single character from the set of characters within the []. () groups and captures matches. | allows matching of the expression to the left or right of the | (or). [htm|asp] matches any one of the characters a, h, m, p, s or t. (htm|asp) matches either htm or asp.

    Perl is Huffman encoded by design.
Re: Regex question
by davidrw (Prior) on Aug 08, 2005 at 12:41 UTC
    Looks like the character class [] vs grouping () has been answered well already .. That aside, I just wanted to recommend that you consider changing it even further to perhaps something like /\.(htm|asp)$/ to avoid presumably false hits on filenames like 'teaspoon.txt', 'grasp' and 'nightmare.txt' .. maybe use the /i modifier as well.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://481801]
Approved by Corion
Front-paged by bofh_of_oz
NodeReaper settles into the armchair by the fire. You weren't using it were you?

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2018-05-28 03:43 GMT
Find Nodes?
    Voting Booth?