Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Converting very nested XHTML to XML

by richill (Monk)
on Jul 07, 2006 at 18:52 UTC ( #559853=perlquestion: print w/ replies, xml ) Need Help??
richill has asked for the wisdom of the Perl Monks concerning the following question:

I have a xhtml document containing 4000 nested tables upto 7 layers deep. All it shows is a nested directory structure. What is a good module to simplfy this. I want to convert it to XML document in the form

<root>
<node>
<name></name>
<node>
<name></name>
</node>
</node>
</root>

I think that using tokeparser to step though the document, testing the tokens will work,and writing xml to a seperate file depending on the results will work or is there a better way?

I've looked at using XSLT to do the transform but either Im missing something or it doesnt cope well with nested structures.

Comment on Converting very nested XHTML to XML
Re: Converting very nested XHTML to XML
by Tanktalus (Canon) on Jul 07, 2006 at 21:14 UTC

    Personally, I have a general method for choosing XML modules to use. Thus, one way is to go through the XHTML, and build up a new XML::Twig object holding the new info, then to dump that.

    However, if all you're doing is Transforming from XML (which XHTML is) to something else, especially something else that is XML-like (in this case, actual XML), then that sounds like exactly what XSLT is for. I suppose that if you have to do some magic parsing of some nodes to split out information from the text rather than build it up, it may get more painful, and that's where the perl w/XML::Twig comes in.

[OT] Re: Converting very nested XHTML to XML
by idsfa (Vicar) on Jul 07, 2006 at 21:37 UTC

    Not knowing what your original XHTML looks like, I can only provide a general XSLT. In the following, I assume each row in a table either contains a table or a td which is the name (your tags may vary). This allows xsl:apply-templates to recursively call the table template.

    <xsl:template match="table"> <node> <xsl:for-each select="tr"> <xsl:apply-templates /> </xsl:for-each> </node> </xsl:template> <xsl:template match="td"> <name><xsl:value-of select="." /></name> </xsl:template>

    Here's a more complete example.


    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. — Cyrus H. Gordon
      The actual html is here, the trouble Im having is that the relationships between the <node> and the following <node> is indicated by the number of images preceeding the TD containing the node name. I guess I need to learn more xslt better because I dont know how to write a rule for that at the moment..

Re: Converting very nested XHTML to XML
by shmem (Canon) on Jul 07, 2006 at 22:29 UTC
    It really depends on what your source file looks like, and on what quirks it needs. Sometimes it's best to choose the wrong parser for the right reason...

    --shmem

    _($_=" "x(1<<5)."?\n".q/)Oo.  G\        /
                                  /\_/(q    /
    ----------------------------  \__(m.====.(_("always off the crowd"))."
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://559853]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2014-12-25 11:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (160 votes), past polls