Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

XML::Simple problem, or How to convert HTML to Perl and then back again.

by Wonko the sane (Deacon)
on Jul 11, 2003 at 18:05 UTC ( #273504=perlquestion: print w/ replies, xml ) Need Help??
Wonko the sane has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

What I am trying to do is take a string of valid HTML, convert it into a Perl Data structure so I can easily modify the individual parts, then output the modified structure back as HTML again.

At first I though that this would be fairly easy using XML::Simple, I seem to be running into a problem when converting back to HTML though. Seems that some tags are improperly being interpreted as tag attributes, rather than the nested tags that they started out as.

I have read through the XML::Simple Documentation but have not been able to gleam a way of getting what I want out of it.

I may be going about this the wrong way entirely. Maybe this cannot effectively be done? I would be very interested in any suggestions on how I might go about doing this. I am not set on using XML::Simple, it just seemed the closest fit to the solution I could find.

Here is a code snippet that I have been working with, that almost gives me what I want.

#!/usr/local/bin/perl5.6.0 -w use strict; use Data::Dumper; use XML::Simple; my $html = q{ <html> <head> <title> test </title> </head> <body bgcolor='red' > <errors> <TMPL_IF NAME='INVALID_WIDGET_SIZE' > waka waka </TMPL_IF> <TMPL_IF NAME='INVALID_WIDGET_COLOR' > waka waka waka </TM +PL_IF> </errors> <table> <tr colspan='2'> <td> text </td> <td> text2 </td> </tr> </table> </body> </html> }; my $xs = XML::Simple->new(); my $ref = $xs->XMLin( $html, ForceContent => 1, KeepRoot => 1, Content +Key => 'content' ); for ( @{$ref->{html}->{body}->{errors}->{TMPL_IF}} ) { if ( $_->{NAME} eq 'INVALID_WIDGET_COLOR' ) { $_->{content} = 'changed it'; } } my $html_data = $xs->XMLout( $ref ); print Dumper( $html_data ); __END__

Outputs:

:!./html-test.pl $VAR1 = '<opt> <html name="head"> <title> test </title> </html> <html bgcolor="red" name="body"> <errors> <TMPL_IF NAME="INVALID_WIDGET_SIZE"> waka waka </TMPL_IF> <TMPL_IF NAME="INVALID_WIDGET_COLOR">changed it</TMPL_IF> </errors> <table colspan="2" name="tr"> <td> text </td> <td> text2 </td> </table> </html> </opt> ';

As you can see, it seems to misinterpret the 'head' and 'tr' tags as attributes to the enclosing tags. it also does not seem to properly nest everything correctly.

Any suggestions or ideas on how I can correct this, or do it another way would be much appreciated.

Best Regards,
Wonko

Comment on XML::Simple problem, or How to convert HTML to Perl and then back again.
Select or Download Code
Re: XML::Simple problem, or How to convert HTML to Perl and then back again.
by pzbagel (Chaplain) on Jul 11, 2003 at 18:25 UTC

    There are many modules in the HTML hierarchy on CPAN. HTML::TokeParser and HTML::TreeBuilder come to mind. Each one handles the HTML document in a different way depending on how you want to access it. TokeParser as the name implies, tokenizes the HTML into tags and text and lets you make changes and print it out one tag at a time. TreeBuilder converts your document into a tree to represent nested elements.

    HTH

    Addendum: I was able to scrounge up a script I wrote that searches a given HTML document for table/td/tr tags and removes the width element using HTML::TokeParser::Simple. It's not exactly what you are looking for, but it should give you a head-start:

    #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple; my $p= HTML::TokeParser::Simple->new(shift); while( my $token=$p->get_token) { $token->delete_attr('width') if $token->is_start_tag(qr/t(?:able|d|r)/); print $token->as_is; }
      Thank you for your help, though I dont really have a problem parsing the HTML,
      Its getting it back TO HTML that is causing me the problems. :-)

      Thanks though.
      Wonko

        Not to sound facetious, but it doesn't get any easier than: print $token->as_is; in TokeParser and print $tree->as_HTML; in TreeBuilder to put it back into HTML form. Start with the right parser, an HTML specific one, and get better results. Remember HTML is not as rigid in it's formatting as XML which makes it flexible but a real pain to parse at times. Using a specialized parser for HTML has many benefits.

        Peace

Re: XML::Simple problem, or How to convert HTML to Perl and then back again.
by bobn (Chaplain) on Jul 11, 2003 at 19:06 UTC

    XML::Simple is meant for simple XML, mainly for use in XML-based configuration files, where the format is known ahead of time. It has some extra tricks, involving the way attributes and elements can be interchanged for attributes with certain magic names , 'id' being one of them.

    It's not made for genral parsing projects and since much HTML that will work as HTML in a browser isn't well-formed XML, XML::Simple is probably not the tool for this application.

    --Bob Niederman, http://bob-n.com
Re: XML::Simple problem, or How to convert HTML to Perl and then back again.
by choocroot (Friar) on Jul 11, 2003 at 19:19 UTC
    Warning: HTML is not XML ... unless you are dealing with XHTML ...
    HTML::TreeBuilder might not work because of your non-HTML tags (the <TMPL_IF> tag), so if the document is "well formed" then I would use XML::Twig:
    use strict; use warnings; use XML::Twig; my $twig = XML::Twig->new( pretty_print => 'indented', twig_handlers => { TMPL_IF => sub { my ($t, $TMPL_IF) = @_; $TMPL_IF->set_text( 'changed it' ); } } ); $twig->parse(\*DATA); $twig->print; __DATA__ <html> <head> <title> test </title> </head> <body bgcolor='red' > <errors> <TMPL_IF NAME='INVALID_WIDGET_SIZE' > waka waka </TMPL_IF> <TMPL_IF NAME='INVALID_WIDGET_COLOR' > waka waka waka </TMPL_IF> </errors> <table> <tr colspan='2'> <td> text </td> <td> text2 </td> </tr> </table> </body> </html>
    output:
    <html> <head> <title> test </title> </head> <body bgcolor="red"> <errors> <TMPL_IF NAME="INVALID_WIDGET_SIZE">changed it</TMPL_IF> <TMPL_IF NAME="INVALID_WIDGET_COLOR">changed it</TMPL_IF> </errors> <table> <tr colspan="2"> <td> text </td> <td> text2 </td> </tr> </table> </body> </html>
    Here, XML::Twig is simply used as a filter, but once your XML is parsed in the $twig object, you can easily access the elements and change their content/tag/attributes.
      You can use

      my $tree = HTML::TreeBuilder->new(); $tree->ignore_unknown(0); # so it doesn't skip unknown $tree->xml_mode(1); # so it will catch <br /> tags i +f you like XHTML $tree->parse_file($file});
      to do it in HTML::TreeBuilder
      Eric Hodges
Re: XML::Simple problem, or How to convert HTML to Perl and then back again.
by saintbrie (Scribe) on Jul 12, 2003 at 00:24 UTC

    Is the template much more complex than the example you provide? If not, you could just concatenate the strings, i.e.

    $top_of_template = qq( <html> <head> <title> test </title> </head> <body bgcolor='red' > ); $bottom_of_template = qq( <table> <tr colspan='2'> <td> text </td> <td> text2 </td> </tr> </table> </body> </html> ); $middle_of_template = (your code); $output = $top_of_template . $middle_of_template . $end_of_template;
    Alternately, you could just use HTML::Template to do this same task:
    <errors> <TMPL_IF NAME="INVALID_WIDGET_SIZE"><TMPL_VAR NAME="INVALID_WIDGET_SI +ZE"></TMPL_IF> <TMPL_IF NAME="INVALID_WIDGET_COLOR"><TMPL_VAR NAME="INVALID_WIDGET_C +OLOR"></TMPL_IF> </errors>

    and then just pass the invalid widget color (or any other variable you'd like) (instead of a simple boolean?) and not have to worry about dealing with XML::Simple and all of its nastiness.

Re: XML::Simple problem, or How to convert HTML to Perl and then back again.
by gmpassos (Priest) on Jul 12, 2003 at 06:50 UTC
    You can use XML::Smart with it's own parser, and printing data() with the wild option:
    use XML::Smart ; my $XML = new XML::Smart(q{ <html> <head> <title> test </title> </head> <body bgcolor='red' > <errors> <TMPL_IF NAME='INVALID_WIDGET_SIZE' > waka waka </TMPL_IF> <TMPL_IF NAME='INVALID_WIDGET_COLOR' > waka waka waka </TM +PL_IF> </errors> <table> <tr colspan='2'> <td> text </td> <td> text2 </td> </tr> </table> </body> </html> } , XML::Smart::Parser) ; print $XML->data(wild=>1 , noheader=>1) ;
    Output:
    <html> <body bgcolor="red"> <errors> <TMPL_IF NAME="INVALID_WIDGET_SIZE"> waka waka </TMPL_IF> <TMPL_IF NAME="INVALID_WIDGET_COLOR"> waka waka waka </TMPL_IF> </errors> <table> <tr colspan="2"> <td> text </td> <td> text2 </td> </tr> </table> </body> <head title=" test "/> </html>
    But note that XML and HTML are different! One thing you can see is the order of the tags and the title that works like an attribute. You can control how the data() method will trate the entry as an attribute or node with some extra flags, but is not good to do this always! I think that the best option is to use some HTML::* module like HTML::TokeParser, as other monks already have said.

    Graciliano M. P.
    "The creativity is the expression of the liberty".

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://273504]
Approved by integral
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2014-08-02 00:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (53 votes), past polls