Re: Regular Expressions
by tlm (Prior) on Jun 19, 2005 at 22:39 UTC
|
The problem with what you have is that regexes are greedy by default, so that .* is going to eat up every character up to the last > on the line. To prevent this, you can use the ? modifier after the * quantifier:
s/(<body\s.*?>)/$1$newContent/is
Another problem with your original regex is that it would miss a tag that spanned more than one line. Fixing this problem is the purpose of the /s modifier above.
But parsing HTML with regexes is unwise. Try something like HTML::Parser.
| [reply] [d/l] |
|
Just for the record, tlm added two 's' items... the first, which was lacking in OP's code, means "substituion." It's not optional.
| [reply] |
|
| [reply] |
Re: Regular Expressions
by davidrw (Prior) on Jun 19, 2005 at 22:49 UTC
|
your $string =~ has an incomplete s/// ... As for the regex, tlm's is good, making the .* non-greedy (otherwise you'll suck everything through to the </html>), and adding the /s modifier to treatt he entire multiline string as a single string (it makes . match anything, including newlines). perldoc perlre will explain those in further detail. So something like:
$string =~ s/(<body\s.*?>)/$1$newContent/is;
But yeah, if if this isn't just a one-time quick & dirty thing, or if you need this kind of thing for other tags, definitely go with tlm's suggestion of HTML::Parser (the example of finding the <title> tag might be a good starting ppoint, depending on what your needs are) or similar.. | [reply] [d/l] [select] |
Re: Regular Expressions
by Adrade (Pilgrim) on Jun 20, 2005 at 04:32 UTC
|
If I understand you correctly, you want to match a body tag whether or not its comes with modifiers.
You want to strip the \s from within your regex, because with it there, this won't match '<body>', but will match '<body >' and '<body one=fish two=fish>'. As some folks mentioned, you need the s before your first delimiter (/) to indicate that you're substituting one thing for another, and the s after your last delimiter to indicate that your match should be viewed as a single-line (and shouldn't stop at a newline). I think some folks responding forgot to remove the \s from within their regexs - or I got the question wrong. The i, of course, indicates case insensitivity. What we end up with is:
s/(<body.*?>)/$1$OtherStuff/si;
I would prefer, for the sake of style, to use the following instead - indicating that the characters before the > should be anything except >.
s/(<body[^>]*>)/$1$OtherStuff/si;
I also like "pushing" stuff around, so to indicate that the first instance of body should be matched, though as far as I know, it won't make a difference.
s/(<body[^>]*>)(.*)$/$1$OtherStuff$2/si;
Hope it helps! -Adam
-- Impossible! The Remonster can only be killed by stabbing him in the heart with the ancient bone saber of Zumakalis!
| [reply] [d/l] [select] |
|
You are using a greedy regex there, and s/<body[^>]*>/$1$blah/gi; will probably grab more than you want.
s/<body[^>]*?>/$1$blah/si;
This would be sufficient, and with the s modifier, it will account for <body> spanning multiple lines. I didn't use g, as I doubt you want to match <body> on a global scale.
Peace
| [reply] [d/l] [select] |
|
I don't understand why this would grab more than what is wanted. It seems to me that the [^>]* will grab everything that isn't a '>', so it'll grab stuff until we finally get to the first instance of '>'. Even though it is greedy, I don't think it would grab past the '>' of the '<body ... >' tag. Care to expand?
$ perl -e 'my $str = "<body something=\"yep\"><a href=\"..\">"; $str =
+~ s/<body[^>]*>/<body>/; print "$str\n"'
<body><a href="..">
$
-Bryan | [reply] [d/l] [select] |
Re: Regular Expressions
by apt_get (Acolyte) on Jun 19, 2005 at 22:45 UTC
|
$string="<body leftmargin=0 bgcolor=\"red\">";
$newContent = "Test String";
$string =~ s/(<body(.*)>)/$1$newContent/g;
HTH | [reply] [d/l] |
|
See what i just posted below.. that won't work if the string is something like (and presumably it is, otherwise it would just be $string .= $newContent;):
my $string =<<EOS;
<body leftmargin=0
bgcolor="red" >
blah and stuff
</body>
EOS
| [reply] [d/l] [select] |
|
Yes, I understand the limitations. Will be more careful in the future. Apologies to the OP, and thanks to the monks...
| [reply] |
|
OT: excellent regex point, but OP's html fu could stand some help: leftmargin is/was IE specific (some other browsers honor it; some don't) and arguement '0' should be double quoted.And re specific example, space between "red" and > is also naughty... (but I can see how the string could -- real world -- include '\n's, I'm skeptical that spreading it out over multiple lines enhances readability.
<UPDATE ++ davidrw, below, re use of JS or TT! Again proving adage that one should put brain in gear before opening mouth. Thanks!
| [reply] |
|
Re: Regular Expressions
by ank (Scribe) on Jun 20, 2005 at 08:22 UTC
|
| [reply] |