Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

XML::Parser - Usage of &

by sumeetgrover (Monk)
on Feb 20, 2013 at 10:46 UTC ( [id://1019756]=perlquestion: print w/replies, xml ) Need Help??

sumeetgrover has asked for the wisdom of the Perl Monks concerning the following question:

Good Morning!

I am debugging a complex legacy code that uses XML::Parser module and I haven't used this module before.

In this code, one of the XML elements looks like this:

<Title>Company A&amp;B Information</Title>

Now, the problem that's happening is that when the parser extracts text from this element, it extracts 'B Information'. I have found that the &amp; is causing some trouble because of which the initial part of the text, i.e. Company A&amp; gets ignored by the parser.

Any ideas about why XML::Parser would do this? I'd appreciate any help. Thanks.

Replies are listed 'Best First'.
Re: XML::Parser - Usage of &amp;
by tobyink (Canon) on Feb 20, 2013 at 11:10 UTC

    XML::Parser shouldn't be ignoring the Company A&amp;; I think what you'll find is that it treats the title as three pieces of character data:

    1. Company A
    2. &
    3. B Information

    And it will treat these as three separate parse events. Quick demonstration:

    use 5.010; use strict; use warnings; use XML::Parser; my $in_title; my $parser = XML::Parser->new( Handlers => { Start => sub { $in_title++ if $_[1] eq 'Title' }, End => sub { $in_title-- if $_[1] eq 'Title' }, Char => sub { say "CHAR: $_[1]" if $in_title }, }, ); $parser->parse(<<'XML'); <Document> <Title>Company A&amp;B Information</Title> <Abstract>Foo</Abstract> </Document> XML

    XML::Parser is very bare-bones, and sees the job of translating those parse events into a useful data structure as being very much your job.

    Personally I prefer DOM-based XML parsers, such as XML::LibXML which parse the entire file into a tree and allow you to manipulate and navigate that tree using the same DOM interface which web browsers expose to Javascript.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

      You are right, the parser is indeed treating the title as:

      1. Company A 2. & 3. B Information

      Therefore, does it mean that our code needs to have the ability to put all these three pieces together and save as one single title?

      Many thanks for your help!

        I'm guessing that right now the code (you haven't posted any, so the best I can do is guess!) in the Char handler is saving a reference to the last bit of character data, and then when the End handler sees the end of the Title element, it does something with that. Maybe something like this:

        use 5.010; use strict; use warnings; use XML::Parser; my ($got_title, $in_title); my $parser = XML::Parser->new( Handlers => { Start => sub { $in_title++ if $_[1] eq 'Title' }, End => sub { $in_title--, say "GOT TITLE: $got_title" if $_[ +1] eq 'Title' }, Char => sub { $got_title = $_[1] if $in_title }, }, ); $parser->parse(<<'XML'); <Document> <Title>Company A&amp;B Information</Title> <Abstract>Foo</Abstract> <Title>Company X&amp;Y Information</Title> <Abstract>Bar</Abstract> </Document> XML

        Instead you want the Char handler to accumulate the pieces of character data using either string appending, or pushing onto an array/arrayref, then use the Start and End handlers to signal when to start and end accumulating character data. For example:

        use 5.010; use strict; use warnings; use XML::Parser; my (@got_title, $in_title); my $parser = XML::Parser->new( Handlers => { Start => sub { $in_title++, @got_title = () if $_[1] eq 'Title +' }, End => sub { $in_title--, say "GOT TITLE: @got_title" if $_[ +1] eq 'Title'; }, Char => sub { push @got_title, $_[1] if $in_title }, }, ); $parser->parse(<<'XML'); <Document> <Title>Company A&amp;B Information</Title> <Abstract>Foo</Abstract> <Title>Company X&amp;Y Information</Title> <Abstract>Bar</Abstract> </Document> XML
        package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
Re: XML::Parser - Usage of &amp;
by Anonymous Monk on Feb 20, 2013 at 11:09 UTC

    Any ideas about why XML::Parser would do this?

    XML::Parser wouldn't ignore anything unless instructed, see XML::Parser Tutorial

Re: XML::Parser - Usage of &amp;
by sundialsvc4 (Abbot) on Feb 20, 2013 at 15:43 UTC
    ... and join(" ", elements...) makes short work of concatenating multiple elements as one string, no matter how elements there may be.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1019756]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-03-19 10:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found