Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Parsing HTML tags with regex

by jithoosin (Scribe)
on Nov 11, 2005 at 08:17 UTC ( [id://507660]=perlquestion: print w/replies, xml ) Need Help??

jithoosin has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have a problem. I wanted to get every html tag in a file,that is every thing between < and > using RegEx (without using HTML::TokenParser). So i used  m/<([^>]+)>/.

But the problem occurs in cases like this <select name="url>adee" value="wq<ew">.Here the ">" inside "url>adee" is stopping the regEx .Is there any good solution using regEx.


After reading the first 2 replies i will explain my situation .I am in a BETwith my friends that it is possible with regEx to do it. Please help me.There is a solution for everything .Please help me. I wanna win the bet.

2005-11-11 Retitled by g0n, as per Monastery guidelines
Original title: 'simple regExpr'

Replies are listed 'Best First'.
Re: Parsing HTML tags with regex
by tphyahoo (Vicar) on Nov 11, 2005 at 10:46 UTC
    Not so fast pal. Did you really win the bet? Can your regex process html comments with brackets in them, such as

    <!-- Html comment with a bracket... > --!>

    No? Use one of the HTML::? modules and go crawl back to your friend and admit you were wrong.

      Thanks for the notification
Re: Parsing HTML tags with regex
by gopalr (Priest) on Nov 11, 2005 at 08:50 UTC

    Hi jithoosin,

    Here is the regex to match the tag with attributes value.

    m#<([^">]+(?:"[^"]+")*[^>]+)>#

    Thanks,
    Gopal.R

      But that would match on:
      a < b implies b > a
      which does not contain an HTML tag. Oh, and it won't match all HTML tags correctly either. Consider for instance:
      <tag attr1="one" attr2="two"> <tag attr='"'> <tag attr1='"'>
      The first one fails to match because your regex requires that if there are double quoted values inside a tag, they must follow each other. And the second fails because your regex doesn't consider single quoted values.
      Perl --((8:>*
      thanks gopal the above regex was usefull
      Hi gopal,
      THANK YOU VERY MUCH. I won the bet .But now i am in bit of trouble. I donot know how to explain the working to my friends.So could you PLEASE explain the working of the regular expression.Once again THANK YOU VERY MUCH GOPAL.
        m# < ## start with < ( ## group start [^">]+ ## text but Not match " and > (?:"[^"]+")* ## if " found, match till end quote found. Its optional [^>]+ ## text but Not match and > ) ## group end > ## End with > #
Re: Parsing HTML tags with regex
by Skeeve (Parson) on Nov 11, 2005 at 09:47 UTC

    Being picky again and, correct me anyone knowing better, but <select name="url>adee" value="wq<ew"> is not legal HTML. It has to be encoded as <select name="url&gt;adee" value="wq&lt;ew">


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Being picky again and, correct me anyone knowing better, but <select name="url>adee" value="wq<ew"> is not legal HTML.
      I know better. You are wrong. It is legal HTML. Don't let the fact some browsers can't parse it fool you.
      Perl --((8:>*
      Hi skeeve,
      the actual thing was <select name="url" style="width:125px" size="1" onchange="if (this.selectedIndex>0) parent.location.href=this.options[this.selectedIndex].value;">.
      I just used replaced it.
        That's not legal html either.

        Oh, sure, people put crap like that on their html pages, but it's not legal html - throw it at any html validator.

        The legal version of that is:

        <select name="url" style="width:125px" size="1" onchange="if (this.sel +ectedIndex&gt;0) parent.location.href=this.options[this.selectedIndex +].value;">
        --
        @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: Parsing HTML tags with regex
by pg (Canon) on Nov 11, 2005 at 08:21 UTC
    "without using HTML::TokenParser"

    Why? This is simply not the right decision. In this case, it is more important to do it right, with the right tool - HTML parser (for example what murugu mentioned), but not strugling with the "right regexp".

Re: Parsing HTML tags with regex
by murugu (Curate) on Nov 11, 2005 at 08:20 UTC
    Try HTML::Parser.

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

Re: Parsing HTML tags with regex
by BUU (Prior) on Nov 11, 2005 at 08:32 UTC
    It's not really possible with a real regex. HTML is an arbitrarily nested grammar, which doesn't work very well with a "regular" expression. However, given than perl's regexen are of the scary, non regular kind, you could probably manage to do it. Like so..
    /(.*)(?{HTML::TokeParser->new( $1 )}/

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://507660]
Approved by jbrugger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-25 10:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found