Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Regular Expressions

by Anonymous Monk
on Jun 19, 2005 at 22:33 UTC ( #468196=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Im trying to insert things directly below a body tag using perl and regex If the body tag contained either
<body>
or
<body leftmargin=0 bgcolor="red">
What would be a proper regex to match either of the above. I have tried
$new content = "Test String"; $string =~ /(<body\s.*>)/$1$newContent/i;
Any suggestions?

Replies are listed 'Best First'.
Re: Regular Expressions
by tlm (Prior) on Jun 19, 2005 at 22:39 UTC

    The problem with what you have is that regexes are greedy by default, so that .* is going to eat up every character up to the last > on the line. To prevent this, you can use the ? modifier after the * quantifier:

    s/(<body\s.*?>)/$1$newContent/is
    Another problem with your original regex is that it would miss a tag that spanned more than one line. Fixing this problem is the purpose of the /s modifier above.

    But parsing HTML with regexes is unwise. Try something like HTML::Parser.

    the lowliest monk

      Just for the record, tlm added two 's' items... the first, which was lacking in OP's code, means "substituion." It's not optional.
      Thank you!
Re: Regular Expressions
by davidrw (Prior) on Jun 19, 2005 at 22:49 UTC
    your $string =~ has an incomplete s/// ... As for the regex, tlm's is good, making the .* non-greedy (otherwise you'll suck everything through to the </html>), and adding the /s modifier to treatt he entire multiline string as a single string (it makes . match anything, including newlines). perldoc perlre will explain those in further detail. So something like:
    $string =~ s/(<body\s.*?>)/$1$newContent/is;
    But yeah, if if this isn't just a one-time quick & dirty thing, or if you need this kind of thing for other tags, definitely go with tlm's suggestion of HTML::Parser (the example of finding the <title> tag might be a good starting ppoint, depending on what your needs are) or similar..
Re: Regular Expressions
by Adrade (Pilgrim) on Jun 20, 2005 at 04:32 UTC
    If I understand you correctly, you want to match a body tag whether or not its comes with modifiers.

    You want to strip the \s from within your regex, because with it there, this won't match '<body>', but will match '<body >' and '<body one=fish two=fish>'. As some folks mentioned, you need the s before your first delimiter (/) to indicate that you're substituting one thing for another, and the s after your last delimiter to indicate that your match should be viewed as a single-line (and shouldn't stop at a newline). I think some folks responding forgot to remove the \s from within their regexs - or I got the question wrong. The i, of course, indicates case insensitivity. What we end up with is:
      s/(<body.*?>)/$1$OtherStuff/si;
    I would prefer, for the sake of style, to use the following instead - indicating that the characters before the > should be anything except >.
      s/(<body[^>]*>)/$1$OtherStuff/si;
    I also like "pushing" stuff around, so to indicate that the first instance of body should be matched, though as far as I know, it won't make a difference.
      s/(<body[^>]*>)(.*)$/$1$OtherStuff$2/si;
    Hope it helps!
      -Adam

    --
    Impossible! The Remonster can only be killed by stabbing him in the heart with the ancient bone saber of Zumakalis!

      You are using a greedy regex there, and s/<body[^>]*>/$1$blah/gi; will probably grab more than you want.

      s/<body[^>]*?>/$1$blah/si;

      This would be sufficient, and with the s modifier, it will account for <body> spanning multiple lines. I didn't use g, as I doubt you want to match <body> on a global scale.

      Peace

        I don't understand why this would grab more than what is wanted. It seems to me that the [^>]* will grab everything that isn't a '>', so it'll grab stuff until we finally get to the first instance of '>'. Even though it is greedy, I don't think it would grab past the '>' of the '<body ... >' tag. Care to expand?

        $ perl -e 'my $str = "<body something=\"yep\"><a href=\"..\">"; $str = +~ s/<body[^>]*>/<body>/; print "$str\n"' <body><a href=".."> $

            -Bryan

Re: Regular Expressions
by apt_get (Acolyte) on Jun 19, 2005 at 22:45 UTC
    Here's one way
    $string="<body leftmargin=0 bgcolor=\"red\">"; $newContent = "Test String"; $string =~ s/(<body(.*)>)/$1$newContent/g;
    HTH
      See what i just posted below.. that won't work if the string is something like (and presumably it is, otherwise it would just be $string .= $newContent;):
      my $string =<<EOS; <body leftmargin=0 bgcolor="red" > blah and stuff </body> EOS
        Yes, I understand the limitations. Will be more careful in the future. Apologies to the OP, and thanks to the monks...
        OT: excellent regex point, but OP's html fu could stand some help: leftmargin is/was IE specific (some other browsers honor it; some don't) and arguement '0' should be double quoted.

        And re specific example, space between "red" and > is also naughty... (but I can see how the string could -- real world -- include '\n's, I'm skeptical that spreading it out over multiple lines enhances readability.

        <UPDATE ++ davidrw, below, re use of JS or TT! Again proving adage that one should put brain in gear before opening mouth. Thanks!
Re: Regular Expressions
by ank (Scribe) on Jun 20, 2005 at 08:22 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://468196]
Approved by davidrw
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2023-01-30 04:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?