Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

XML::Parser breaks on

by r.joseph (Hermit)
on Aug 16, 2001 at 11:59 UTC ( #105300=perlquestion: print w/ replies, xml ) Need Help??
r.joseph has asked for the wisdom of the Perl Monks concerning the following question:

Hello again everyone,

Been a long time since my last post, but I am really stuck this time, and can't figure why.

I am using XML::Parser to parse .RSS documents from Linux.com and Newsforge.net for a "live news feed" if you will. First, below is a snippet from Linux.com's RSS doc:

<item> <title>Big software companies lose their minds!</title> <link>http://linux.com/newsitem.phtml?sid=1&amp;aid=12492</link> <description>Linux.com corresponent Mark Miller has some views on big software companies.</description> </item>

Now, if you look at the <link> tag, you see that there is a proper sequence, &amp; to represent an ampersand. However, here is the problem. When XML::Parser encounters this chunk of data, it calls the Char handler, whatever you define it to be. Mine happens to be very simple, atleast right now (BTW, I am using the Subs style for the parser, but tha shouldn't matter):

sub found_char { my ($ex, $str) = @_; if ($ex->in_element('link') && $ex->within_element('item')) { print "\t\tLink: $str\n"; } }

So I should expect a simple string that has Link: and then the link, whatever that may be. However, it seems that XML::Parser instead, for some reason, splits on that escape sequence, so I get this output:

Link: http://linux.com/newsitem.phtml?sid=1 Link: & Link: aid=12492

What I CANNOT figure out is why it seems to consider that string within the <link> element three strings!

Does anyone know how this can be fixed - I have seen this problem happen with other "non-element" data, and I just want it to grab all of the pertient data at one time.

Thanks a ton!

r. j o s e p h
"Violence is a last resort of the incompetent" - Salvor Hardin, Foundation by Issac Asimov

Comment on XML::Parser breaks on
Select or Download Code
Re: XML::Parser breaks on
by mirod (Canon) on Aug 16, 2001 at 12:09 UTC

    This is a documented behaviour of XML::Parser. Actually XML::Parser documents the fact that this "can" happen. It actually happens for every entity, line break and expat input buffer boundary crossed. The review gives you a way to deal with it: basically you cannot use the data in the char handler, you just buffer it until you hit a tag (open _or_ close).

    By the way, did you try XML::RSS? Maybe it would make it easier for you to process your data.

Re: XML::Parser breaks on
by blakem (Monsignor) on Aug 16, 2001 at 12:14 UTC
    I think you'll have to glue it all together. Here is a snippet that might help a bit.

    sub xml_char { my ($xp, $txt) = @_; my $el = $xp->current_element(); $val{$el} .= $txt if $txt =~ /\S/; }

    Notice that %val is sort of a buffer area that will need to be cleaned up when you hit the end tag.

    -Blake

      I've used something like this before, and like the general technique, but I question:
      $val{$el}.=$txt if $txt =~ /\S/;
      Since the parser is actually allowed to break anywhere you could lose intra-word spaces or newlines (if they're signicficant).
        You're probably right and its a piece of code I haven't looked at in a long time. I do remember there being a reson for it, but can't remember it right now. Anyone looking at using this, should probably get rid of the if conditional.

        -Blake

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://105300]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2015-07-04 18:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls