Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Understanding this particular Regex.

by tty1x (Novice)
on May 05, 2013 at 06:58 UTC ( #1032107=perlquestion: print w/replies, xml ) Need Help??
tty1x has asked for the wisdom of the Perl Monks concerning the following question:

<HR +SIZE *=  *[0-9]+ *> is meant to match the <HR> Html tag.

What I am confused about is that why does the above code match <hr size=4 > when the regex for the number is [0-9]+ ?

Doesn't [0-9]+ ? mean any number from 0-9 for the 1st number, and any number from 0-9 for the 2nd number ? Why does the + match a non-number such as in the case when I enter a single digit such as 4 ?

Thanks :)

Replies are listed 'Best First'.
Re: Understanding this particular Regex.
by NetWallah (Canon) on May 05, 2013 at 07:13 UTC
    The "+" matches "the previous item" one or more times.

    In this case, "the previous item" is a digit [0-9].

    So, at least ONE digit is required, but more digits, if present, are acceptable and will match.

    In your example of a solitary "4", the "+" does NOT match anything beyond that digit.

    The subsequent " *" matches zero or more trailing spaces. See "Quantifiers" in perldoc perlre.

                 "I'm fairly sure if they took porn off the Internet, there'd only be one website left, and it'd be called 'Bring Back the Porn!'"
            -- Dr. Cox, Scrubs

      Ahh thanks alot for the explanation. Guess my understanding of the + quantifier isn't that great .
Re: Understanding this particular Regex.
by ww (Archbishop) on May 05, 2013 at 10:47 UTC

    "<HR +SIZE *=  *[0-9]+ *> " is readily understood as a regex only if one also assumes the use of alternate delimiters; for example

    if ( $somevarv =~ m<HR +SIZE *= *[0-9]+ *> ) { do something.... }

    One can argue that that's implicit in your post. I suspect NetWallah would so argue and certainly offered an accurate response based on that interpretation.

    OTOH, one can certainly also argue that the OP almost mischievously ambiguous (or simply wrong). For example the spaces allocated -- if they exist -- and the lack of quotes around the size value -- would make the target-tag for your so-called regex non-conformant with html 4.01 or any more recent spec... a fact that's not conclusive but which might lead those familiar with the w3c specs to scratch their heads and wonder what you're talking about, before moving on to other questions.

    My point? Careful attention to precision and specificity can (and usually is) very helpful to those who try to help those who are SOPW.

    If you didn't program your executable by toggling in binary, it wasn't really programming!

      Nonsense. Go along to the W3C HTML validator, select "Direct Input" and paste in the following markup:

      <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <title>Foo</title> <hr size = 1 >

      ... and you'll find it's fully conformant HTML 4.01. Spaces around the equals sign are rarely used, but valid. And attribute values conforming to the following regexp do not need to be quoted:


      And HTML5 is even more permissive.

      package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
        Not "nonsense."

        Last time I chased this kinda' question thru the specs themselves, the validator came up short of fully satisfying the w3c 4.01 transitional spec and even farther short of the strict spec.

        The validator, for example, blesses your code ("validates") without error (albeit, with warnings) despite the lack of <head>...</head>, <body>...</body> and <</html> tags... and that's using the transitional spec which allows no such things.

        If you try it with strict, upload mode, and add:

        <table width = 17%>

        you'll see even the validator lets fly:

        If this error occurred in a script section of your document, you should probably read this FAQ entry. Error Line 9, Column 18: an attribute value must be a literal unless it contains only name characters <table width = 17%> You have used a character that is not considered a "name character" in an attribute value. Which characters are considered "name character +s" varies between the different document types, but a good rule of thumb +is that unless the value contains only lower or upper case letters in the + range a-z you must put quotation marks around the value. In fact, unless you have extreme file size requirements it is a very very good +idea to always put quote marks around your attribute values. It is never wr +ong to do so, and very often it is absolutely necessary."

        Your regex and the accompanying statement are correct, as far as they go, but are most closely applicable to webmonkeys (yeah, been there; done that.) writing for NS or IE4 style browsers. Today, however, you'll find widths (for example and where used) expressed as ems, ens (no problem as long as you don't introduce spaces) or as percentages... as in the example above. The "%" sign is an example of a warstopper.

        A little knowledge is a dangerous thing; categorical statements based on incomplete knowledge are apt to be even more so.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1032107]
Approved by Athanasius
and a soft breeze sighs...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2018-04-25 18:45 GMT
Find Nodes?
    Voting Booth?