Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: RegEx for incorrectly closed HTML attribute?

by LAI (Hermit)
on Nov 29, 2002 at 17:55 UTC ( [id://216571]=note: print w/replies, xml ) Need Help??


in reply to Re: RegEx for incorrectly closed HTML attribute?
in thread RegEx for incorrectly closed HTML attribute?

You might want to try for something more like:

m!<a(\s+[\w]+\s*\=\s*('[^']*'|"[^"]*"))*\s*>.*?</a\s*>!i

Update: at Abigail-II's suggestion, here's a modified version of the above, which accepts tags like <a href= foo_link>link text</a>. Of course, comments and criticism are always welcome.

m!<a(\s+[\w]+\s*\=\s*('[^']*'|"[^"]*"|[a-z0-9\-\._:]+))*\s*>.*?</a\s*> +!i

This should be what you're looking for, because (if I got it right) it successfully detects any valid anchor tag. Once you've got that, you can substitute stuff for SGML entities wherever you haven't found a valid tag, like s/"/&quot;/g and so forth. That way, any invalid code gets printed verbatim. Instead of:
A link with no closing tag where there really should be one...
You'll see
A <a href="#">link with no closing tag where there really should be one...

One limitation might pop up if the users start nesting anchors inside one another... This is why my initial response if it were my own app and server would be "get smarter users" :o)

Anyway, as always, there's bound to be faults with what I wrote above. Here's what I used to test it:

#!/usr/bin/perl for (<>) { m!<a(\s+[\w]+\s*\=\s*('[^']*'|"[^"]*"))*\s*>.*?</a\s*>!i ? pri +nt "match: " : print "no match: "; print; }
And my dataset:
<a href="foo">blah</a> <a href="foo>blah</a> <a href='foo" >blah</a> <a href="foo">blah</b> <a href="foo's">bar</a> <a name="blah" href="foo" >bar</a>
And my results:
match: <a href="foo">blah</a> no match: <a href="foo>blah</a> no match: <a href='foo" >blah</a> no match: <a href="foo">blah</b> match: <a href="foo's">bar</a> match: <a name="blah" href="foo" >bar</a>

LAI
:eof

Replies are listed 'Best First'.
Re: RegEx for incorrectly closed HTML attribute?
by Abigail-II (Bishop) on Nov 29, 2002 at 18:53 UTC
    Two examples that will fail the regex:
    <A HREF = link>FOO</A> <A HREF = "link"><!-- </a>-->FOO</A>

    Abigail

      I know. The first is an example of illegal HTML (at least, illegal as of XHTML 1.0) and the second is an example of nesting, as I mentioned. In the application Cody Pendant is (writing|maintaining) I would personally accept those as acceptable exceptions: neither will screw up more than the poster's message. As I understood it, the biggest problem with leaving open-ended links or otherwise screwing up the HTML was that the rest of the page would be screwed up as well. These two will get rendered as
      <A HREF = link>FOO</A>
      and
      <!-- -->FOO</A>
      respectively (assuming Cody Pendant swaps characters for entities).
      LAI
      :eof
        The point is the detect wrong or illegal HTML, so assuming the given text validates is silly. If it would validate, the whole excercise would be futile. Also, the first example is valid HTML, and has always been valid HTML. In the second example, no nesting is going on. There's just one A element.

        Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://216571]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2025-07-15 10:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.