Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re^5: How to Truncate Corrupt Document.xml Files?

by educated_foo (Vicar)
on Feb 16, 2012 at 04:36 UTC ( #954140=note: print w/replies, xml ) Need Help??

in reply to Re^4: How to Truncate Corrupt Document.xml Files?
in thread How to Truncate Corrupt Document.xml Files?

Can't I just define say a start handler with XML::Parser that adds non self ending tags to an array and then define an end handler that removes tags from the same array. Then maybe at the end of parsing all that would be left in the array would be those tags not found by the end handler and these tags could be added to the end of the xml file in reverse order with last in first out?
That's basically what I was trying to suggest. SAX is one common stream-based parser that people coming from a non-Perl backgrounds might know. XML::Parser is another stream-based parser which is, IMHO, easier to use.
  • Comment on Re^5: How to Truncate Corrupt Document.xml Files?

Replies are listed 'Best First'.
Re^6: How to Truncate Corrupt Document.xml Files?
by socrtwo (Sexton) on Feb 16, 2012 at 18:08 UTC

    I constructed the beginnings of a script that is supposed to keep a running total non-ended tags with the XML::PARSER. The problem is that XML::PARSER errors out when XML is defective, which is exactly when I want the the rest of the script to work. So I'm assuming that I have to switch to SAX so the script will run as a stream and add and subtract to the array until it hits XML corruption as you were originally suggesting I expect.

    So here's the script with XML::PARSER that doesn't run when validation problems exist. When they don't exist it returns nothing for the @tags array which should be correct.

    #!/usr/bin/perl use XML::Parser; use strict; my $xml_file = $ARGV[0]; my $parser = new XML::Parser; $parser->setHandlers( Start => \&start_tag_handler, End => \&end_tag_handler, ); $parser->parsefile($xml_file); my @tags; sub start_tag_handler { my $p = shift; my $element = shift; my $parent = $p->current_element; my $realtag = "$parent::$element"; push(@tags, $realtag); } sub end_tag_handler { my $p2 = shift; my $element2 = shift; my $parent2 = $p2->current_element; my $realtag2 = "$parent2::$element2"; my $index = 0; $index++ until $tags[$index] eq "$realtag2"; splice(@tags, $index, 1); } open (MYFILE, '>data.txt'); print MYFILE "Tags in the array are @tags\n"; close (MYFILE);


    On another crucial for me subject I'd expect...why are externally initiated arrays available outside a subroutine like the @tags available above in a script but not in a module like below?:

    package truncator; require 5.005_62; use strict; use XML::SAX::Base; our @ISA = ('XML::SAX::Base'); our $VERSION = '0.01'; my @tags; sub new { my ($type) = @_; return bless {}, $type; } my $current_element = ''; sub start_element { my ($self, $element) = @_; $current_element = $element->{Name}; push(@tags, $current_element); } print @tags; 1;

    The print @tags line doesn't return anything when outside the subroutine, but it would if it were in a script.


    It looks like I was reinventing the wheel. Xmllint will reliably putting the correct ending tags on corrupt XML with --recover command. I did find a case though where its truncation and ending tag solutions didn't suit MS Word. So what I want to do know is figure out how to truncate an XML file a configurable amount of characters before the first error, and then apply the command line xmllint --recover.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://954140]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2018-06-18 04:04 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (107 votes). Check out past polls.