Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

RegEx for incorrectly closed HTML attribute?

by Cody Pendant (Prior)
on Nov 29, 2002 at 04:07 UTC ( [id://216418]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on an BBS where there are sometimes serious errors with people typing their own hyperlinks.

What happens is people type <a href=" and then paste in a URL, and then fail to close the quotes, or indeed close the quotes but with a single quote, as in <a href="http://whatever.com/'>

This causes all kinds of horrible problems for the remainder of the page.

Does anyone have a regex solution to this?
--

($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: RegEx for incorrectly closed HTML attribute?
by PodMaster (Abbot) on Nov 29, 2002 at 07:01 UTC
    I see people insisting on regular expressions all the time, maybe you wanna give YAPE::HTML a try, it's pure perl.


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: RegEx for incorrectly closed HTML attribute?
by boo_radley (Parson) on Nov 29, 2002 at 05:14 UTC
Re: RegEx for incorrectly closed HTML attribute?
by dingus (Friar) on Nov 29, 2002 at 08:50 UTC
    The problem is likely to be worse than just lack of quote. Once you fix that you will still have the people who forget the closing </a>, or who cock up about 50 other simple syntax rules. I'd suggest you simply have a preview page as we have here in perlmonks where the user can see what his post really looks like.

    If you want to validate just what you mention then a checker regex is

    my ($openquote, $uri, $closer) = m!<a\s+href\s*=\s*(['"])([^>'"]+)(.)! +i;
    Then
    • its valid if $openquote eq $closer.
    • trailing quote omitted if $closer eq '>'.
    • else trailing quote mismatched.
    Its up to you to figure out the replacements and/or whether to do it as a single regex - probably better not to try as it will be ugly. Probably its best to reject the post and make the user fix it, that way they won't make a mistake again.

    Dingus


    Enter any 47-digit prime number to continue.
      Unfortunally, that regex will fail if the href attribute contains a quote (an other quote that the delimiting one), if it contains a >; if the attribute value of the href doesn't have quotes, or if there are other attributes between the element name and the href attribute.

      Abigail

        You might want to try for something more like:

        m!<a(\s+[\w]+\s*\=\s*('[^']*'|"[^"]*"))*\s*>.*?</a\s*>!i

        Update: at Abigail-II's suggestion, here's a modified version of the above, which accepts tags like <a href= foo_link>link text</a>. Of course, comments and criticism are always welcome.

        m!<a(\s+[\w]+\s*\=\s*('[^']*'|"[^"]*"|[a-z0-9\-\._:]+))*\s*>.*?</a\s*> +!i

        This should be what you're looking for, because (if I got it right) it successfully detects any valid anchor tag. Once you've got that, you can substitute stuff for SGML entities wherever you haven't found a valid tag, like s/"/&quot;/g and so forth. That way, any invalid code gets printed verbatim. Instead of:
        A link with no closing tag where there really should be one...
        You'll see
        A <a href="#">link with no closing tag where there really should be one...

        One limitation might pop up if the users start nesting anchors inside one another... This is why my initial response if it were my own app and server would be "get smarter users" :o)

        Anyway, as always, there's bound to be faults with what I wrote above. Here's what I used to test it:

        #!/usr/bin/perl for (<>) { m!<a(\s+[\w]+\s*\=\s*('[^']*'|"[^"]*"))*\s*>.*?</a\s*>!i ? pri +nt "match: " : print "no match: "; print; }
        And my dataset:
        <a href="foo">blah</a> <a href="foo>blah</a> <a href='foo" >blah</a> <a href="foo">blah</b> <a href="foo's">bar</a> <a name="blah" href="foo" >bar</a>
        And my results:
        match: <a href="foo">blah</a> no match: <a href="foo>blah</a> no match: <a href='foo" >blah</a> no match: <a href="foo">blah</b> match: <a href="foo's">bar</a> match: <a name="blah" href="foo" >bar</a>

        LAI
        :eof

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://216418]
Approved by tadman
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2025-07-13 21:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.