Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re: Bug in XML::Parser

by graff (Chancellor)
on Oct 23, 2013 at 04:24 UTC ( #1059273=note: print w/replies, xml ) Need Help??

in reply to Bug in XML::Parser

As pointed out previously, your xml input file is bad - there's either an "extra" <CallDetailByService> open tag at line 208, or else you're missing a second </CallDetailByService> close tag at line 258. One way or the other, it's an easy thing to fix.

Apart from that, your main "foreach" loop indicates that you don't have a proper understanding yet of how to use XML::Parser. You should not be reading an xml file one line at a time and passing certain lines to the parser. That is absolutely the wrong way.

Use the parser to read (and parse) the entire file, and use the various handler subroutines to do what needs to be done as you encounter the elements of interest in the data. For example, if you want to print the contents of <Amount> elements to STDOUT, you could do something like this (after you fix your xml file):

#!/usr/bin/perl use strict; use warnings; use XML::Parser; my $current_element = my $current_amount = ""; my $p = XML::Parser->new( Handlers => { Start => \&handle_start, Char => \&handle_text, End => \&handle_end } ); $p->parsefile( "org1.xml" ); sub handle_start { my ( $xp, $element, %attr ) = @_; $current_element = $element; # keep track of where we are } sub handle_end { my ( $xp, $element ) = @_; if ( $element eq 'Amount' ) { # did we just close an "Amount" t +ag? print "$current_amount\n"; $current_amount = ""; } $current_element = ""; } sub handle_text { my ( $xp, $string ) = @_; # do stuff here depending on where we are now: $current_amount .= $string if ( $current_element eq 'Amount' ); }
Now, isn't that a lot simpler? That's the whole point of using an XML parser - to make things simpler.

Replies are listed 'Best First'.
Re^2: Bug in XML::Parser
by manunamu (Initiate) on Oct 23, 2013 at 05:03 UTC
    Thanks graff. Appreciate the help. However, (and I should have explained this in my earlier post itself) I am deliberately trying to parse one XML line after another rather than the whole xml file at once. The reason was to find if there are any mismatched tags and then replace the tags and do other corrections. I am still stumped at the ability of XML parser to detect badly-formed XML in spite of the fact that I am not parsing the whole file at once. My assumption is that since I am parsing line by line, XML::Parser has no knowledge of the what is coming next and therefore, it should not be able to detect a badly formed XML. The fact that it does is indeed fantastic albeit completely confounding.
      If you are trying "to find if there are any mismatched tags", that sounds like you are looking for errors that would cause an XML parser to fail (and it appears that the sample xml data you posted has this kind of problem, so I understand your goal now).

      But what that really means is that you can't really use an XML parser at all to solve this problem. As pointed out above, it's easy enough to check for xml errors using xmllint, although the error reports you get can sometimes be difficult to interpret, and the actual problem can still be hard to spot.

      I would be inclined to use a regex-based diagnosis - something like this:

      #!/usr/bin/perl use strict; use warnings; my $infile = shift; # get input file name from @ARGV open( my $fh, "<:utf8", $infile ) or die $!; local $/; # slurp the whole file in the next line $_ = <$fh>; s/^<\?.*>\s+//; # ditch the "<?xml...?>" line, if any my %open_tags; my %close_tags; for my $tkn (split/(?<=>)|(?=<)/) { # split on look-behind | look-ahe +ad for brackets if ( $tkn =~ m{^<(\/?)(\w+)} ) { if ( $1 eq '' ) { $open_tags{$2}++; } else { $close_tags{$2}++; } } } for my $tag ( sort keys %open_tags ) { if ( ! exists( $close_tags{$tag} )) { warn sprintf( "%s: open tag %s is never closed in %d occurrenc +e(s)\n", $infile, $tag, $open_tags{$tag} ); } else { if ( $close_tags{$tag} != $open_tags{$tag} ) { warn sprintf( "%s: element %s has %d open tags but %d clos +e tag(s)\n", $infile, $tag, $open_tags{$tag}, $close_tags +{$tag} ); } delete $close_tags{$tag}; } } for my $tag ( keys %close_tags ) { warn sprintf( "%s: close tag %s has no open tags in %d occurrence( +s)\n", $infile, $tag, $close_tags{$tag} ); }
      That will at least give you a clear tally of imbalances (if any) in the open/close tag inventory for a given xml file. You should be able to use this information, together with the line numbers from the xmllint reports, to locate the problems.

      So, when you find these mismatched tags, isn't the next step to look at the process that is creating the xml files, and fix that? (These xml files aren't being created by manual editing, are they??)

      (Update: BTW, I forgot to mention... this new information in your reply makes your OP even more egregiously obtuse. If you had said at the beginning, "I have this xml file that has an error in the tags, and I need to figure out how to find the problem," then the discussion would have been more effective. I know, you already feel bad about the OP, and I shouldn't pile it on, but it needs to be said.)

        Thanks graff. You have been great help. Yes, you are right about stressing on the point about OP and point is well taken. I understand your solution. But, what still confounds me completely is how the XML::Parser is able to detect the badly formed XML file even if the program is parsing one line at a time. This obviously could be because my understanding of the XML::Parser is limited. One of the unintended consequences that your reply had gave me an insight into a problem that I had earlier, but never got around asking the permonks site. Aren't unintended consequences great when they turn out to be good? Thanks macho again!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1059273]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2018-05-22 10:34 GMT
Find Nodes?
    Voting Booth?