Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^4: How to Truncate Corrupt Document.xml Files?

by socrtwo (Sexton)
on Feb 16, 2012 at 04:16 UTC ( #954138=note: print w/ replies, xml ) Need Help??


in reply to Re^3: How to Truncate Corrupt Document.xml Files?
in thread How to Truncate Corrupt Document.xml Files?

I read that the SAX parser is not so good for rebuilding the XML document which is what I want to do, unless I use 2 parsing instances, one as a SAX parser to analyze the document.xml file and the other with XML::Parser to actually add the intended end tags and rebuild the document.xml.

However is there any real benefit to this use of SAX? Can't I just define say a start handler with XML::Parser that adds non self ending tags to an array and then define an end handler that removes tags from the same array. Then maybe at the end of parsing all that would be left in the array would be those tags not found by the end handler and these tags could be added to the end of the xml file in reverse order with last in first out?


Comment on Re^4: How to Truncate Corrupt Document.xml Files?
Re^5: How to Truncate Corrupt Document.xml Files?
by educated_foo (Vicar) on Feb 16, 2012 at 04:36 UTC
    Can't I just define say a start handler with XML::Parser that adds non self ending tags to an array and then define an end handler that removes tags from the same array. Then maybe at the end of parsing all that would be left in the array would be those tags not found by the end handler and these tags could be added to the end of the xml file in reverse order with last in first out?
    That's basically what I was trying to suggest. SAX is one common stream-based parser that people coming from a non-Perl backgrounds might know. XML::Parser is another stream-based parser which is, IMHO, easier to use.

      I constructed the beginnings of a script that is supposed to keep a running total non-ended tags with the XML::PARSER. The problem is that XML::PARSER errors out when XML is defective, which is exactly when I want the the rest of the script to work. So I'm assuming that I have to switch to SAX so the script will run as a stream and add and subtract to the array until it hits XML corruption as you were originally suggesting I expect.

      So here's the script with XML::PARSER that doesn't run when validation problems exist. When they don't exist it returns nothing for the @tags array which should be correct.

      #!/usr/bin/perl use XML::Parser; use strict; my $xml_file = $ARGV[0]; my $parser = new XML::Parser; $parser->setHandlers( Start => \&start_tag_handler, End => \&end_tag_handler, ); $parser->parsefile($xml_file); my @tags; sub start_tag_handler { my $p = shift; my $element = shift; my $parent = $p->current_element; my $realtag = "$parent::$element"; push(@tags, $realtag); } sub end_tag_handler { my $p2 = shift; my $element2 = shift; my $parent2 = $p2->current_element; my $realtag2 = "$parent2::$element2"; my $index = 0; $index++ until $tags[$index] eq "$realtag2"; splice(@tags, $index, 1); } open (MYFILE, '>data.txt'); print MYFILE "Tags in the array are @tags\n"; close (MYFILE);

      Update:

      On another crucial for me subject I'd expect...why are externally initiated arrays available outside a subroutine like the @tags available above in a script but not in a module like below?:

      package truncator; require 5.005_62; use strict; use XML::SAX::Base; our @ISA = ('XML::SAX::Base'); our $VERSION = '0.01'; my @tags; sub new { my ($type) = @_; return bless {}, $type; } my $current_element = ''; sub start_element { my ($self, $element) = @_; $current_element = $element->{Name}; push(@tags, $current_element); } print @tags; 1;

      The print @tags line doesn't return anything when outside the subroutine, but it would if it were in a script.

      Update

      It looks like I was reinventing the wheel. Xmllint will reliably putting the correct ending tags on corrupt XML with --recover command. I did find a case though where its truncation and ending tag solutions didn't suit MS Word. So what I want to do know is figure out how to truncate an XML file a configurable amount of characters before the first error, and then apply the command line xmllint --recover.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://954138]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (13)
As of 2014-12-26 13:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (171 votes), past polls