I constructed the beginnings of a script that is supposed to keep a running total non-ended tags with the XML::PARSER. The problem is that XML::PARSER errors out when XML is defective, which is exactly when I want the the rest of the script to work. So I'm assuming that I have to switch to SAX so the script will run as a stream and add and subtract to the array until it hits XML corruption as you were originally suggesting I expect.
So here's the script with XML::PARSER that doesn't run when validation problems exist. When they don't exist it returns nothing for the @tags array which should be correct.
#!/usr/bin/perl
use XML::Parser;
use strict;
my $xml_file = $ARGV[0];
my $parser = new XML::Parser;
$parser->setHandlers(
Start => \&start_tag_handler,
End => \&end_tag_handler,
);
$parser->parsefile($xml_file);
my @tags;
sub start_tag_handler
{
my $p = shift;
my $element = shift;
my $parent = $p->current_element;
my $realtag = "$parent::$element";
push(@tags, $realtag);
}
sub end_tag_handler
{
my $p2 = shift;
my $element2 = shift;
my $parent2 = $p2->current_element;
my $realtag2 = "$parent2::$element2";
my $index = 0;
$index++ until $tags[$index] eq "$realtag2";
splice(@tags, $index, 1);
}
open (MYFILE, '>data.txt');
print MYFILE "Tags in the array are @tags\n";
close (MYFILE);
Update:
On another crucial for me subject I'd expect...why are externally initiated arrays available outside a subroutine like the @tags available above in a script but not in a module like below?:
package truncator;
require 5.005_62;
use strict;
use XML::SAX::Base;
our @ISA = ('XML::SAX::Base');
our $VERSION = '0.01';
my @tags;
sub new {
my ($type) = @_;
return bless {}, $type;
}
my $current_element = '';
sub start_element {
my ($self, $element) = @_;
$current_element = $element->{Name};
push(@tags, $current_element);
}
print @tags;
1;
The print @tags line doesn't return anything when outside the subroutine, but it would if it were in a script.
Update
It looks like I was reinventing the wheel. Xmllint will reliably putting the correct ending tags on corrupt XML with --recover command. I did find a case though where its truncation and ending tag solutions didn't suit MS Word. So what I want to do know is figure out how to truncate an XML file a configurable amount of characters before the first error, and then apply the command line xmllint --recover. |