Think about Loose Coupling

HTML::TreeBuilder, HTML::Element, as_XML()

by AlexTape (Monk)
on May 23, 2013 at 10:35 UTC ( #1034929=perlquestion: print w/replies, xml ) Need Help??
AlexTape has asked for the wisdom of the Perl Monks concerning the following question:

Dear omniscient monks,

i got some html/tei like data and want to parse it to xml format. it is working pretty well for some files.. but not for all.. here is my code:
# pragma use strict; use warnings; # modules use XML::Simple; use XML::Tidy; use Data::Dumper; use Data::Diver qw( Dive DiveRef DiveError ); use HTML::TreeBuilder; use XML::Tidy::Tiny; # little helper use constant false => 0; use constant true => 1; ... # get instance of treebuilder my $root = HTML::TreeBuilder->new(); # configure treebuilder $root->ignore_unknown( false ); # dump data to the treebuilder $root->parse( $fileData ); # get name for target file my $target = $file; $target =~ s/$fileExtension$/xml/; # open output filehandle open( $FH, '>', $target ); # configure output binmode $FH, ":utf8"; # ERROR HERE 208: my $data = $root->guts()->as_XML(); print $FH xml_tidy( $data ); close $FH; ...
caption has an invalid attribute name 'n' at line 208
i substite all 'n' in the file.. but got still the same error. for that the 'n' is not the anchor of this error.. i dont know what going on here?!
is okey.. it is all about the ->as_XML() :-((

kindly, perlig

$perlig =~ s/pec/cep/g if 'errors expected';

Re: HTML::TreeBuilder, HTML::Element, as_XML()
by Jenda (Abbot) on May 23, 2013 at 15:16 UTC

    IMnsHO, there is a bug in the _valid_name subroutine deep in HTML::Element. There should be

    return (0) unless ( $attr =~ /^$START_CHAR$NAME_CHAR*$/ );
    return (0) unless ( $attr =~ /^$START_CHAR$NAME_CHAR+$/ );

    The XML specs say that

    Name ::= NameStartChar (NameChar)*

      IMnsHO, there is a bug in the _valid_name subroutine deep in HTML::Element. There should be

      I wouldn't go that far , the OP provides no data

        The OP doesn't need to provide data, the code doesn't match the specs linked five lines above the code in question.

Re: HTML::TreeBuilder, HTML::Element, as_XML()
by ambrus (Abbot) on May 24, 2013 at 13:07 UTC

    Look in the implementation of XML::Twig for the workarounds it uses when HTML::Tree's as_XML method dies.

[marto]: good morning all
