Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: XML parsing with XML::Rules

by jakeease (Friar)
on Jun 17, 2013 at 09:12 UTC ( #1039308=note: print w/replies, xml ) Need Help??


in reply to XML parsing with XML::Rules

I tried this with your problem area:

my $summary; # = $quake->get( "summary"); # print $summary; while ($summary = $quake->get("summary")) { $parser->parse_chunk($summary); }
and got this error message: junk after document element at line 1, column 601, byte 601 at C:/strawberry/perl/site/lib/XML/Rules.pm line 933. That is in XML::Rules->sub _parse_or_filter_chunk, and for me it raises the question of whether you need to read chunks at all. The error message I got is from an eval calling parse_more($string). A few line up, near the beginning of the routine, is a line reading     croak "This parser is already busy parsing a full document!"

So the question is have you read in the whole document, and if so is there another method, say parse that should be used instead of parse_chunk?

UPDATE

I tried it again, this way using parse:

my $summary = $quake->get( "summary"); print $summary; #while ($summary = $quake->get("summary")) { $parser->parse($summary); #} #my $data = $parser->last_chunk(); #my $dd = $data->get( "dd"); #print $dd, "\n";
with the result:
C:\Users\JKeys>perl \myperl\quake.pl # This Quake file created by quake_parsing_9 # Matt Coblentz; Perl version unknown # For more information, see the USGS website # Last Updated: 1 17 5 2013, 4:34:55 # junk after document element at line 1, column 601, byte 601 at C:/stra +wberry/perl/site/lib/XML/Rules.pm line 745. <p class="quicksummary"><a href="http://earthquake.usgs.gov/earthquake +s/eventpage/usc000hsdj#pager" title="PAGER estimated impact alert lev +el" class="pager-gree n">PAGER - <strong class="roman">GREEN</strong></a> <a href="http://ea +rthquake.usgs.gov/earthquakes/eventpage/usc000hsdj#shakemap" title="S +hakeMap maximum estim ated intensity" class="mmi-V">ShakeMap - <strong class="roman">V</stro +ng></a> <a href="http://earthquake.usgs.gov/earthquakes/eventpage/usc +000hsdj#dyfi" class=" mmi-IV" title="Did You Feel It? maximum reported intensity (5 reports) +">DYFI? - <strong class="roman">IV</strong></a></p><dl><dt>Time</dt>< +dd>2013-06-16 21:39:0 9 UTC</dd><dd>2013-06-16 23:39:09 +02:00 at epicenter</dd><dt>Location +</dt><dd>34.491&deg;N 25.087&deg;E</dd><dt>Depth</dt><dd>37.85 km (23 +.52 mi)</dd></dl>

So I'm still getting the "junk" message, this time from the parse method. Don't know if that's the feed, your code, or my tweaks. But it's sleepy time now

Replies are listed 'Best First'.
Re^2: XML parsing with XML::Rules
by mcoblentz (Scribe) on Jun 18, 2013 at 00:02 UTC
    Hi,

    I tried originally with 'parse' and several other XML methods. Because I had already read in the XML document and extracted the CDATA material, the "chunk" becomes an incorrectly formed XML document (it does not have a single root element wrapping all the other elements. The CDATA content starts with a paragraph tag, which stops in the middle, then picks up with some table elements). That's where the 'junk' error comes from - Perl is complaining about a poorly formed XML document.

    I was a little embarrassed that I thought "read_in_some_data" was a real method - but the conversation has been really helpful - I honestly thought the whole documentation was trying to trick me and just couldn't figure out where my error was (silly me). Jenda had alluded to just picking up the CDATA fragments and re-parsing, which led me down a whole weird dead end. He has the right approach, I just interpreted his comments incorrectly.

    This XML stuff is "fussy". I'm hoping to wrap my head around all of this because having scripts like this will make the code overall much easier to maintain. Just find the feed, parse, and go.

    I'm just having trouble with the recursive bits of the overall process. The data changes and thus the details have to change.

    Thanks to all for pitching in. This has been a useful discussion about feeds, XML, etc.

    Matt

      It had slipped my mind that the summary was CDATA as I didn't look back at the previous post. And you're right, it's the explanation for the junk message. If Perl is complaining about a poorly formed XML document, it's because we are trying to convince it that $summary is XML.

      It isn't, of course, it's HTML. And that's what Jenda meant when he said

      If you want to split that into pieces you have to pass that string to another HTML or XML parser.

      I was about to suggest parsing $summary with LWP or HTML::Parser when I read poj's post. I like how he has simplified it and shown HTML::TreeBuilder handling $summary.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1039308]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (1)
As of 2021-02-27 07:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?