Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Regex For HTML Image Tags?

by alfie (Pilgrim)
on Mar 27, 2001 at 11:51 UTC ( #67454=note: print w/replies, xml ) Need Help??


in reply to Regex For HTML Image Tags?

You need to tweak your regular expression a little bit. Let me start with this fast diddle:
$html =~ s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)>/[image: "$1"]/sgi;
Let me explain what I did: I changed your .*? to [^>]+? for you only want to match non-end delimeters for the <img> tag in here, and also there I think it's more sensible to use + than * for there is no special need to catch all an empty tag, and there must at least be a whitespace inbetween :-)

Secondly, why did you escape the brackets in the alt-tag, and at the end? That doesn't really make sense, for you want the special meaning of it at that point. There is also no need to escape it in the replacement string for they don't have a special meaning there.

And, you need to put brackets with a ? followed around the alt-part for as you already noticed it wouldn't match tags without an alt-tag. I did it with (?: so it won't get stored.

This will produce the following:

<img foo><img alt="bar"> [image: ""][image: "bar"]
If you want to have just plain [image] in the replacement if there is no alt atribute present I guess that wouldn't be possible with a single substitute, but you can still do the following substitution afterwards:
$html =~ s/\[image: ""\]/[image]/g;
HTH & HAND!
--
Alfie

Replies are listed 'Best First'.
Re: Re: Regex For HTML Image Tags?
by zodiac (Beadle) on Mar 27, 2001 at 12:45 UTC
    it is possible in one regular expression though:
    $html=~s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)?>/"[image".((defined $1)?": +\"$1\"":"")."]"/sgei;
    short explanation:
    the match starts with "<img" followed by something that is not the end of the tag (but don't be greedy), or it will also match the ALT part which is optional "(?: )?" which should be self-explanatory (with basic perl knowledge)

    we then substitute with an expression (the /e modifier)

    long explanation:
    I am too lazy to write this.

      Let me see.

      You would match that text inside of attribute values for other tags.

      You fail to consider that the closing > can appear in the values of other attributes for the IMG tag. There are quite a few which could have it.

      The alt attribute may be quoted with "", '', or nothing at all. You only deal with one of these cases.

      There is optional whitespace between ALT and = and = and the value. Not accounted for.

      In my experience the odds of your being bitten are highest for the different delimiter, then for munging up text that appeared in quoted delimiters. The others are possible but unlikely.

      If you know your data, then an RE is OK. I have certainly done that. But if you don't, then an RE hack will break sooner or later...

      I had been munging on a regex as well (just an exercise), and I think the extended regex clearifies a bit:
      $html=<DATA>; $html =~ s/<IMG \s+ #match the IMG tag SRC \s* = \s* "[^"]+" \s* #match the Source (ALT \s* = \s* "([^"]+)" \s*)? #match an optional Alt > #end of tag /'[image' . ($2 ? ": $2" : '') .']' #print the image stuff /sgixe; print $html; __DATA__ <IMG SRC="foo"><BR> bar bar bar<BR> <IMG SRC="foo" alt="bar">
      This works, but keep in mind that the IMG tag is still valid if for example, the SRC and the ALT are reversed in order.

      That's why HTML::Tokeparser (as Desdinova pointed out already) or maybe even (if the HTML is yours) Template Toolkit are better approaches.

      Cheers,

      Jeroen
      "We are not alone"(FZ)

      I knew about that (somewhere, deep hidden in my memories) - but couldn't find it quickly in the manual pages. Strangely it's the first modifier described in the perlop section *hmm*
      Thanks for pointing it out, I simply haven't found it :)
      --
      Alfie

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://67454]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2022-05-17 12:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (65 votes). Check out past polls.

    Notices?