Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parsing HTML tags with regex

by jithoosin (Scribe)
on Nov 11, 2005 at 08:17 UTC ( #507660=perlquestion: print w/ replies, xml ) Need Help??
jithoosin has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have a problem. I wanted to get every html tag in a file,that is every thing between < and > using RegEx (without using HTML::TokenParser). So i used  m/<([^>]+)>/.

But the problem occurs in cases like this <select name="url>adee" value="wq<ew"> .Here the ">" inside "url>adee" is stopping the regEx .Is there any good solution using regEx.


After reading the first 2 replies i will explain my situation .I am in a BETwith my friends that it is possible with regEx to do it. Please help me.There is a solution for everything .Please help me. I wanna win the bet.

2005-11-11 Retitled by g0n, as per Monastery guidelines
Original title: 'simple regExpr'

Comment on Parsing HTML tags with regex
Select or Download Code
Re: Parsing HTML tags with regex
by murugu (Curate) on Nov 11, 2005 at 08:20 UTC
    Try HTML::Parser.

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

Re: Parsing HTML tags with regex
by pg (Canon) on Nov 11, 2005 at 08:21 UTC
    "without using HTML::TokenParser"

    Why? This is simply not the right decision. In this case, it is more important to do it right, with the right tool - HTML parser (for example what murugu mentioned), but not strugling with the "right regexp".

Re: Parsing HTML tags with regex
by BUU (Prior) on Nov 11, 2005 at 08:32 UTC
    It's not really possible with a real regex. HTML is an arbitrarily nested grammar, which doesn't work very well with a "regular" expression. However, given than perl's regexen are of the scary, non regular kind, you could probably manage to do it. Like so..
    /(.*)(?{HTML::TokeParser->new( $1 )}/
Re: Parsing HTML tags with regex
by gopalr (Priest) on Nov 11, 2005 at 08:50 UTC

    Hi jithoosin,

    Here is the regex to match the tag with attributes value.

    m#<([^">]+(?:"[^"]+")*[^>]+)>#

    Thanks,
    Gopal.R

      Hi gopal,
      THANK YOU VERY MUCH. I won the bet .But now i am in bit of trouble. I donot know how to explain the working to my friends.So could you PLEASE explain the working of the regular expression.Once again THANK YOU VERY MUCH GOPAL.
        m# < ## start with < ( ## group start [^">]+ ## text but Not match " and > (?:"[^"]+")* ## if " found, match till end quote found. Its optional [^>]+ ## text but Not match and > ) ## group end > ## End with > #
      But that would match on:
      a < b implies b > a
      which does not contain an HTML tag. Oh, and it won't match all HTML tags correctly either. Consider for instance:
      <tag attr1="one" attr2="two"> <tag attr='"'> <tag attr1='"'>
      The first one fails to match because your regex requires that if there are double quoted values inside a tag, they must follow each other. And the second fails because your regex doesn't consider single quoted values.
      Perl --((8:>*
      thanks gopal the above regex was usefull
Re: Parsing HTML tags with regex
by Skeeve (Vicar) on Nov 11, 2005 at 09:47 UTC

    Being picky again and, correct me anyone knowing better, but <select name="url>adee" value="wq<ew"> is not legal HTML. It has to be encoded as <select name="url&gt;adee" value="wq&lt;ew">


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Hi skeeve,
      the actual thing was <select name="url" style="width:125px" size="1" onchange="if (this.selectedIndex>0) parent.location.href=this.options[this.selectedIndex].value;">.
      I just used replaced it.
        That's not legal html either.

        Oh, sure, people put crap like that on their html pages, but it's not legal html - throw it at any html validator.

        The legal version of that is:

        <select name="url" style="width:125px" size="1" onchange="if (this.sel +ectedIndex&gt;0) parent.location.href=this.options[this.selectedIndex +].value;">
        --
        @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Being picky again and, correct me anyone knowing better, but <select name="url>adee" value="wq<ew"> is not legal HTML.
      I know better. You are wrong. It is legal HTML. Don't let the fact some browsers can't parse it fool you.
      Perl --((8:>*
Re: Parsing HTML tags with regex
by tphyahoo (Vicar) on Nov 11, 2005 at 10:46 UTC
    Not so fast pal. Did you really win the bet? Can your regex process html comments with brackets in them, such as

    <!-- Html comment with a bracket... > --!>

    No? Use one of the HTML::? modules and go crawl back to your friend and admit you were wrong.

      Thanks for the notification

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://507660]
Approved by jbrugger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2014-07-28 13:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (199 votes), past polls