Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: XML::SAX::ParserFactory policy and differences between parser implementations

by beech (Parson)
on Mar 01, 2016 at 19:43 UTC ( [id://1156575]=note: print w/replies, xml ) Need Help??


in reply to XML::SAX::ParserFactory policy and differences between parser implementations

I am trying to parse some big xml files while not eating all the user memory, so XML::SAX::Parser seems to be the solution.

The solution is called XML::Twig, see http://xmltwig.org/tutorial/

update: It seems you're already aware of twig,

anyway, the docs aren't clear what is supposed to be going on, but the information is out there :) use xml_decl handler

#!/usr/bin/perl -- use strict; use warnings; use XML::SAX; use Module::Load qw/ load /; my @files = ( ... ); my $parsers = XML::SAX->parsers(); for my $parser ( @$parsers ){ load( $parser->{Name} ); print "\n$parser->{Name}\n"; for my $file ( @files ){ $parser->{Name}->new( Handler => MySAXHandler->new, )->parse_file( $file ); } } package MySAXHandler; use base qw( XML::SAX::Base ); use Data::Dump qw/ pp /; sub start_document { _pper('doc', @_ ) } sub start_dtd { _pper('dtd', @_ ) } sub xml_decl { _pper('decl', @_ ) } sub _pper { my ($name, $self, $doc) = @_; print " $name ", pp( %$doc ), "\n"; } __END__ XML::SAX::Expat doc ("Version", "1.0", "Encoding", "UTF-8", "Standalone", "") doc ("Version", "1.0", "Encoding", "ISO-8859-1", "Standalone", "") doc () XML::LibXML::SAX::Parser doc () decl ("Version", "1.0", "Encoding", "UTF-8") doc () decl ("Version", "1.0", "Encoding", "ISO-8859-1") doc () decl ("Version", "1.0", "Encoding", undef) XML::LibXML::SAX doc () decl ("Version", "1.0") doc () decl ("Version", "1.0", "Encoding", "ISO-8859-1") doc () decl ("Version", "1.0")
  • Comment on Re: XML::SAX::ParserFactory policy and differences between parser implementations
  • Download Code

Replies are listed 'Best First'.
Re^2: XML::SAX::ParserFactory policy and differences between parser implementations
by seki (Monk) on Mar 02, 2016 at 01:54 UTC

    Yes, I am aware of XML::Twig, but it is not suitable to my needs (or at leat I did not see how I could use it, because I need to "patch" an already parsed element to adjust its value during the parsing ans split of a big block of elements that I prefer not to keep in memory)

    As you mention yourself in your results, the different SAX parsers are not consistent in regard to the SAX events, at least for XML::SAX::Expat that includes the encoding into start_document() data instead of xml_decl() data or XML::SAX::PurePerl that does not notify xml_decl() at all

    Also I do not get the same results as you with my test program and data. Could you check for what file XML::LibXML::SAX manages to give you an encoding? You can see it does not with my utf-8 sample.

    data.xml

    <?xml version="1.0" encoding="UTF-8" ?> <root> <foo> <bar attr="baz">héhé mes 2 €</bar> <baz other="dummy"/> </foo> </root>

    test_sax.xml

    use strict; use warnings; use feature 'say'; #~ use Say; #portability trick for 5.8.8 use XML::SAX::ParserFactory; use XML::SAX::Writer; my $input = $ARGV[0] or die "usage: $0 <file.xml> [parser_package]"; $XML::SAX::ParserPackage = $ARGV[1] if $ARGV[1]; my $output; #just for not outputting to STDOUT my $writer = new XML::SAX::Writer(Output => \$output); my $handler = new SaxHandler( Handler => $writer ); my $parser = XML::SAX::ParserFactory->parser( Handler => $handler ); say sprintf "parser is %s (%s)", ref $parser, $parser->VERSION ; $parser->parse_file($input); { package SaxHandler; use base 'XML::SAX::Base'; use Data::Printer {indent=>2}; use feature 'say'; #~ use Say; #portability trick for 5.8.8 sub xml_decl { my ($self, $decl) = @_; say "decl ", np $decl; $self->SUPER::xml_decl($decl); } sub start_document { my ($self, $doc) = @_; say "document ", np $doc; $self->SUPER::start_document($doc); } sub start_element { my ($self, $el) = @_; #~ say "start element " . $el->{LocalName}; $self->SUPER::start_element($el); } }

    my results:

    macbookseb:perl seb$ perl -v This is perl 5, version 22, subversion 1 (v5.22.1) built for darwin-th +read-multi-2level[...] macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::PurePerl parser is XML::SAX::PurePerl (0.99) document \ {} macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::Expat parser is XML::SAX::Expat (0.51) document \ { Encoding "UTF-8", Standalone "", Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX parser is XML::LibXML::SAX (2.0124) document \ {} decl \ { Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX::Parse +r parser is XML::LibXML::SAX::Parser (2.0124) document \ {} decl \ { Encoding "UTF-8", Version 1.0 }

      Also I do not get the same results as you with my test program and data.

      What do you get with my program?

      update:

      Could you check for what file XML::LibXML::SAX manages to give you an encoding?

      When its not utf-8 when its  encoding="ISO-8859-1"

        with your code and my data.xml:
        XML::SAX::PurePerl doc () XML::SAX::Expat doc ("Standalone", "", "Encoding", "UTF-8", "Version", "1.0") XML::SAX::ExpatXS doc () decl ("Encoding", "UTF-8", "Version", "1.0", "Standalone", undef) XML::LibXML::SAX::Parser doc () decl ("Version", "1.0", "Encoding", "UTF-8") XML::LibXML::SAX doc () decl ("Version", "1.0")
        It agrees with my own test: XML::LibXML::SAX does not get the encoding while XML:LibXML::SAX::Parser does.
        It seems that the parser must be carefully and explicitly selected to get consistent results, rather than letting the factory pass a broken parser.

        Update: indeed it seems that XML::LibXML::SAX fails to give the encoding of an utf-8 encoded file while it succeeds with an iso-8859-1. I have no other encoding from xml file right available for another test.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1156575]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 13:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found