Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: nested tag matching

by Vautrin (Hermit)
on Feb 05, 2004 at 14:45 UTC ( [id://326766]=note: print w/replies, xml ) Need Help??


in reply to nested tag matching

First of all, if you're a novice you should have the following code at the top of all your code:

use strict; use warnings;

This will help catch mistakes before they happen. Second, why aren't you using a module from CPAN to parse the HTML, i.e. HTML::TreeBuilder. You should never mess around with regular expressions on HTML. The original SGML specifications from which HTML is derived are pretty loose, which means for every rule there are a half dozen exceptions (or more!) which will render under most browsers even though it makes for a pain to parse. Not only will it make your code more robust, it will make your code much more intuitive to read, i.e.:

use strict; use warnings; use HTML::TreeBuilder; my $HTML_to_parse = shift (@ARGV); my $tree = HTML::TreeBuilder->new; $tree->parse($HTML_to_parse); $tree->eof; my @paragraph_tags = $tree->look_down('_tag', 'p'); foreach my $p (@paragraph_tags) { # note that this variable will "hide" the other # copy of @paragraph_tags and be garbage collected # as soon as it goes out of scope (the end of the # while loop) my @paragraph_tags = $p->look_down('_tag', 'p'); if (scalar (@paragraph_tags) == 1) { my $tag = shift (@paragraph_tags); my @contents = $tag->content_list; my $content = ""; foreach my $con (@contents) { # check that we have text and not an object $content .= $con unless (ref $con); } print $content; } }

Just to give you an idea of why using regular expressions to parse HTML is a bad idea, look at this:

<p class="foo">This is <p class="bar">HTML code using CSS Style sheets +.</p></p>

Now you have no contingencies for the class="" in your original regular expressions. So your code would break on a page that made use of attributes for any of the tags. HTML::TreeBuilder would take it in stride and let you access the attributes if you ever needed to use them using: my %attr = $node->all_external_attr;. So again, don't reinvent the wheel if you don't have to.

Replies are listed 'Best First'.
Re: Re: nested tag matching
by Anonymous Monk on Feb 06, 2004 at 11:58 UTC
    Hai Vautrin,

    Thanks for ur kind reply.

    Actually im working in an concern as text file to sgml/xml/html conversion and validation programmer(Trainee). what i asked as an example is not only for HTML but for all the mark up languages.

    secondly, Im working with the plain text files with some short tags applied in them to convert them into some markup languages(XML or SGML) under some dtd specification.

    input:

    [[tx]][[nm]]Murugesan[[/nm]]is trainee perl programmer.[[/tx]] [[tx1]][[nm]]Murugesan[[/nm]]is trainee perl programmer.[[/tx1]]

    output:

    <customer><name type="rrd">Murugesan</name> is trainee perl programmer +</customer> <vendor><name type="integra">Murugesan</name> is trainee perl programm +er</vendor>

    my next doubt is, while im converting above input text inside some short tag(nesting), what im doing now is im using the subroutines in regular expression to convert them into the output format as mentioned in my below code.

    s/\[\[tx\]\](.*?)\[\[\/tx\]\]/'<customer>'.&txt($1).'<\/customer>'/egs +i; sub txt{ my $a=$_[0]; $a=~s/\[\[nm\]\](.*?)\[\[\/nm\]\]/<name type="rrd">$1<\/name>/gsi; return $a; }

    In the above coding i have just done for single level nesting of name. some times more number of nestings are present. Is this method of using subroutines inside the regular expression which lead to nested subroutines is o.k or is there any other effecient method available.

      First of all, unless you're using a very old version of perl, you can use custom quotes for your regular expressions, i.e.:

      would be much more clear written as:

      It's not really that much more readable because you have a lot of special charachters in your regular expression, but not having to escape out every / does make things clearer.

      You may want to take a lesson from HTML::TreeBuilder or XML::TreeBuilder and create a tree if you're doing a lot of work on your custom tags. Check out the code for XML::TreeBuilder. They basically created it as a subclass of HTML::TreeBuilder, overloaded some of the properties of the Elements, and had a complete system to process and handle XML.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://326766]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-25 11:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found