Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

HTML::TreeBuilder, HTML::Element, as_XML()

by AlexTape (Monk)
on May 23, 2013 at 10:35 UTC ( #1034929=perlquestion: print w/replies, xml ) Need Help??
AlexTape has asked for the wisdom of the Perl Monks concerning the following question:

Dear omniscient monks,

i got some html/tei like data and want to parse it to xml format. it is working pretty well for some files.. but not for all.. here is my code:
# pragma use strict; use warnings; # modules use XML::Simple; use XML::Tidy; use Data::Dumper; use Data::Diver qw( Dive DiveRef DiveError ); use HTML::TreeBuilder; use XML::Tidy::Tiny; # little helper use constant false => 0; use constant true => 1; ... # get instance of treebuilder my $root = HTML::TreeBuilder->new(); # configure treebuilder $root->ignore_unknown( false ); # dump data to the treebuilder $root->parse( $fileData ); # get name for target file my $target = $file; $target =~ s/$fileExtension$/xml/; # open output filehandle open( $FH, '>', $target ); # configure output binmode $FH, ":utf8"; # ERROR HERE 208: my $data = $root->guts()->as_XML(); print $FH xml_tidy( $data ); close $FH; ...
caption has an invalid attribute name 'n' at line 208
i substite all 'n' in the file.. but got still the same error. for that the 'n' is not the anchor of this error.. i dont know what going on here?!
is okey.. it is all about the ->as_XML() :-((

kindly, perlig

$perlig =~ s/pec/cep/g if 'errors expected';

Replies are listed 'Best First'.
Re: HTML::TreeBuilder, HTML::Element, as_XML()
by Jenda (Abbot) on May 23, 2013 at 15:16 UTC

    IMnsHO, there is a bug in the _valid_name subroutine deep in HTML::Element. There should be

    return (0) unless ( $attr =~ /^$START_CHAR$NAME_CHAR*$/ );
    return (0) unless ( $attr =~ /^$START_CHAR$NAME_CHAR+$/ );

    The XML specs say that

    Name ::= NameStartChar (NameChar)*

    Enoch was right!
    Enjoy the last years of Rome.

      IMnsHO, there is a bug in the _valid_name subroutine deep in HTML::Element. There should be

      I wouldn't go that far , the OP provides no data

        The OP doesn't need to provide data, the code doesn't match the specs linked five lines above the code in question.

        Enoch was right!
        Enjoy the last years of Rome.

Re: HTML::TreeBuilder, HTML::Element, as_XML()
by ambrus (Abbot) on May 24, 2013 at 13:07 UTC

    Look in the implementation of XML::Twig for the workarounds it uses when HTML::Tree's as_XML method dies.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1034929]
Front-paged by Arunbear
[1nickt]: Morning monks, howzit?
[ambrus]: AAAAAAARGH!
[ambrus]: something must be wrong here
[ambrus]: oh I see. I'm stupid
[marto]: hi 1nickt!
[ambrus]: The file I'm reading is already filtered, which is why I found only one kind of entry.
[1nickt]: Sorry to upset you brother Ambrus ...
[1nickt]: Ah not me, good!

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2018-01-23 13:04 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (246 votes). Check out past polls.