Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

XML parsing with XML::Rules

by mcoblentz (Scribe)
on Jun 16, 2013 at 00:23 UTC ( #1039168=perlquestion: print w/replies, xml ) Need Help??

mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Continuing on the ATOM parsing question, I have extracted a snipped of response from a CDATA section which contains more XML. Monk Jenda's very fine suggestion was to re-parse. However, I am getting a:

Undefined subroutine &main::read_chunk_of_data called at quake_parsing +_9 line 81.

response from the system. I copied the 'example' straight out of CPAN but it seems that the call I'm making is not valid. The entire program is included below and the issue seems to occur around line 82 (now marked with a bunch of asterisks).

#!/usr/bin/perl -w use XML::FeedPP; use XML::LibXML; use XML::Rules; use LWP::Simple; use Time::Piece; use strict; use warnings; # ----------------------------------------------- # open the quake marker file for writing # ----------------------------------------------- open (quake_marker, '>/Users/coblem/xplanet/markers/quake.txt') or die + $!; # ----------------------------------------------- # get the Atom feed from the USGS # ----------------------------------------------- #my $source = '/Users/coblem/xplanet/quake/quake.xml'; #use this line +only for local testing my $source = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary +/4.5_day.atom'; #use this line for feed testing my $feed = XML::FeedPP->new( $source ); # get the date and time. While not needed right now, this will help l +ater with managing expired # data. # ----------------------------------------------- # get date and time, day of the year # ----------------------------------------------- my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(t +ime); $year += 1900; my $cur_time = "$wday $mday $mon $year, $hour:$min:$sec"; # ----------------------------------------------- # update the output file header # ----------------------------------------------- my $dayofYear = localtime[7]; print (quake_marker "# This Quake file created by quake_parsing_9", "\ +n"); print (quake_marker "# Matt Coblentz; Perl version unknown", "\n"); print (quake_marker "# For more information, see the USGS website", "\ +n"); print (quake_marker "# Last Updated: ", $cur_time, "\n", "\# \n"); # Incorporating a rules section so that we can extract the information + from the CDATA section # and any other specific fields we might need. First attempt will be +to pull the 'Updated' # field data from the Atom feed. I tried to follow the example in the + CPAN summary but I'm # clearly lost. # PerlMonk Jenda pointed out that I should use the standard rules prev +iously defined. There were # other helpful comments about the author and update rules as well, wh +ich are now deprecated. my @rules = ( _default => 'content', dd => 'content trim', ); my $parser = XML::Rules->new(rules => \@rules); # Need to convert the feed object to a string. Thanks to PerlMonk Fro +zenwithjoy my $atom = $feed->to_string(); $parser->parse( $atom ); # This section extracts the title field, then performs string manipula +tions to extract the # long location data and the magnitude. Funny that USGS does not have + a magnitude field # in this feed. foreach my $quake( $feed->get_item() ) { my $title = $quake->title(); my $place = substr($title, 8); my $magnitude = substr($title, 2,3); # Thanks to PerlMonk Jakeease for fixing these extractions: my $id = $quake->get( "id"); my $update = $quake->get( "updated" ); my $location = $quake->get( "georss:point"); # *********************************************** # *********************************************** # THIS IS THE PROBLEM AREA! # *********************************************** # *********************************************** my $summary = $quake->get( "summary"); print $summary; while ($summary = read_chunk_of_data()) { $parser->parse_chunk($summary); } my $data = $parser->last_chunk(); my $dd = $data->get( "dd"); print $dd, "\n"; # print (quake_marker $location, "\"\" color=Yellow symbolsize=65", + "\n"); # print (quake_marker $location, "\"", $magnitude, "\" color=Yellow + align=Above", "\n"); # print (quake_marker "\n"); print (quake_marker "\n", "Magnitude ", $magnitude, " ", $place, " +\n"); print (quake_marker $location, " \"\" color=Yellow symbolsize=65", + "\n"); print (quake_marker $location, " \"", $magnitude, "\" color=Yellow + align=Above", "\n"); print "\n"; } close (quake_marker);

What am I doing wrong? I'm obviously invoking something that isn't there, but what should it be?

Replies are listed 'Best First'.
Re: XML parsing with XML::Rules
by Khen1950fx (Canon) on Jun 16, 2013 at 03:30 UTC
    Your problem area appears to be missing
    $parser->parse_chunk($summary)
    Adding that to the mix:
    my $summary = $quake->get("summary"); print $summary; $parser->parse_chunk($summary); while ( $summary = read_chunk_of_data() ) { $parser->parse_chunk($summary); } my $data = $parser->last_chunk(); my $dd = $data->get("dd"); print $dd, "\n";
Re: XML parsing with XML::Rules
by kcott (Bishop) on Jun 16, 2013 at 09:17 UTC

    G'day mcoblentz,

    It would appear that's poorly documented and read_chunk_of_data() is either pseudocode or a function you're supposed to write yourself.

    I haven't used this module previously; however, something like this seems to be the intent:

    $ perl -Mstrict -Mwarnings -e ' use XML::Rules; use Data::Dumper; my @xml = qw{<some_tag> some content </some_tag>}; my $parser = XML::Rules::->new( rules => [ _default => sub {$_[0] => $_[1]->{_content}} ] ); for (@xml) { $parser->parse_chunk($_); } my $data = $parser->last_chunk(); print Dumper $data; ' $VAR1 = { 'some_tag' => 'somecontent' };

    Someone who's actually used the module before may have a better answer.

    Update: Minor text change: s/write it yourself/write yourself/

    -- Ken

      or a function you're supposed to write it yourself.

      XML::Rules does show read_chunk_of_data() in documentation for parse_chunk()

        or a function you're supposed to write it yourself.

        XML::Rules does show read_chunk_of_data() in documentation for parse_chunk()

        I've got no idea why you wrote that. There is no suggestion by anyone that it doesn't appear in the documentation. Perhaps you could elaborate on what you meant.

        I have changed "write it yourself" to "write yourself" which is what I originally intended; however, I can't see that either form changes the meaning with respect to what appears in the documentation.

        -- Ken

Re: XML parsing with XML::Rules
by poj (Abbot) on Jun 16, 2013 at 18:15 UTC

    Here is a simplified version of your script which uses XML::FeedPP to get the XML, XML::Rules to extract the data and HTML::TreeBuilder::XPath to extract the times from the summary. I have also included a simple regex to extract the times should you not be able to install the XPath module. Adapt as you require.

    #!/usr/bin/perl -w use strict; use warnings; use XML::FeedPP; use XML::Rules; use HTML::TreeBuilder::XPath; # input my $source = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary +/4.5_day.atom'; my $atom_xml = XML::FeedPP::Atom->new( $source )->to_string(); # output my $outfile = "quake.txt"; open my $fh,'>',$outfile or die "$!"; # parser my @rules = ( _default => 'content', title => \&title, entry => \&report_item, ); my $parser = XML::Rules->new(rules => \@rules); # process report_header(); $parser->parse( $atom_xml ); close $fh; sub title { my $title = $_[1]->{'_content'}; 'magnitude' => substr($title,2,3), 'place' => substr($title,8); } sub report_item { my $summary = $_[1]->{summary}; # extract time from summary using XPath my $tree = HTML::TreeBuilder::XPath->new_from_content($summary); my @dd = $tree->findvalues('//dd'); # extract time using regex my $t1; my $t2; if ($summary =~ m!<dt>Time</dt> <dd>(.*)\ UTC</dd> <dd>(.*)\ at\ epicenter</dd>!x){ $t1 = $1; $t2 = $2; } print $fh <<EOF Place : $_[1]->{place} Magnitude : $_[1]->{magnitude} Updated : $_[1]->{updated} Location : $_[1]->{'georss:point'} Time Xpath: $dd[0] $dd[1] Time regex: $t1 : $t2 Summary : $summary EOF } sub report_header { my $cur_time = localtime; print $fh <<EOF # This Quake file created by quake_parsing_9 # Matt Coblentz; Perl version unknown # For more information, see the USGS website # Last Updated: $cur_time EOF }
    HTH
    poj
      Here is a more simplified script which uses the get method on XML::FeedPP items as an alternative to XML::Rules.
      #!/usr/bin/perl -w use strict; use warnings; use XML::FeedPP; use HTML::TreeBuilder::XPath; # input my $source = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary +/4.5_day.atom'; # output my $outfile = "quake.txt"; open my $fh,">",$outfile or die "$!"; # process report_header(); my $feed = XML::FeedPP->new( $source ); foreach my $quake( $feed->get_item() ) { my $title = $quake->get('title'); my $magnitude = substr($title,2,3); my $place = substr($title,8); my $updated = $quake->get('updated'); my $locn = $quake->get('georss:point'); my $summary = $quake->get('summary'); my $id = $quake->get('id'); # extract time from summary using XPath my $tree = HTML::TreeBuilder::XPath->new_from_content($summary); my @dd = $tree->findvalues('//dd'); # extract time using regex my $t1; my $t2; if ($summary =~ m!<dt>Time</dt> <dd>(.*)\ UTC</dd> <dd>(.*)\ at\ epicenter</dd>!x){ $t1 = $1; $t2 = $2; } print $fh <<EOF ID : $id Title : $title Place : $place Magnitude : $magnitude Updated : $updated Location : $locn Summary : $summary Time Xpath: $dd[0] $dd[1] Time regex: $t1 : $t2 EOF } close $fh; sub report_header { my $cur_time = localtime; print $fh <<EOF # This Quake file created by quake_parsing_9 # Matt Coblentz; Perl version unknown # For more information, see the USGS website # Last Updated: $cur_time EOF }
      poj
Re: XML parsing with XML::Rules
by jakeease (Friar) on Jun 17, 2013 at 09:12 UTC

    I tried this with your problem area:

    my $summary; # = $quake->get( "summary"); # print $summary; while ($summary = $quake->get("summary")) { $parser->parse_chunk($summary); }
    and got this error message: junk after document element at line 1, column 601, byte 601 at C:/strawberry/perl/site/lib/XML/Rules.pm line 933. That is in XML::Rules->sub _parse_or_filter_chunk, and for me it raises the question of whether you need to read chunks at all. The error message I got is from an eval calling parse_more($string). A few line up, near the beginning of the routine, is a line reading     croak "This parser is already busy parsing a full document!"

    So the question is have you read in the whole document, and if so is there another method, say parse that should be used instead of parse_chunk?

    UPDATE

    I tried it again, this way using parse:

    my $summary = $quake->get( "summary"); print $summary; #while ($summary = $quake->get("summary")) { $parser->parse($summary); #} #my $data = $parser->last_chunk(); #my $dd = $data->get( "dd"); #print $dd, "\n";
    with the result:
    C:\Users\JKeys>perl \myperl\quake.pl # This Quake file created by quake_parsing_9 # Matt Coblentz; Perl version unknown # For more information, see the USGS website # Last Updated: 1 17 5 2013, 4:34:55 # junk after document element at line 1, column 601, byte 601 at C:/stra +wberry/perl/site/lib/XML/Rules.pm line 745. <p class="quicksummary"><a href="http://earthquake.usgs.gov/earthquake +s/eventpage/usc000hsdj#pager" title="PAGER estimated impact alert lev +el" class="pager-gree n">PAGER - <strong class="roman">GREEN</strong></a> <a href="http://ea +rthquake.usgs.gov/earthquakes/eventpage/usc000hsdj#shakemap" title="S +hakeMap maximum estim ated intensity" class="mmi-V">ShakeMap - <strong class="roman">V</stro +ng></a> <a href="http://earthquake.usgs.gov/earthquakes/eventpage/usc +000hsdj#dyfi" class=" mmi-IV" title="Did You Feel It? maximum reported intensity (5 reports) +">DYFI? - <strong class="roman">IV</strong></a></p><dl><dt>Time</dt>< +dd>2013-06-16 21:39:0 9 UTC</dd><dd>2013-06-16 23:39:09 +02:00 at epicenter</dd><dt>Location +</dt><dd>34.491&deg;N 25.087&deg;E</dd><dt>Depth</dt><dd>37.85 km (23 +.52 mi)</dd></dl>

    So I'm still getting the "junk" message, this time from the parse method. Don't know if that's the feed, your code, or my tweaks. But it's sleepy time now

      Hi,

      I tried originally with 'parse' and several other XML methods. Because I had already read in the XML document and extracted the CDATA material, the "chunk" becomes an incorrectly formed XML document (it does not have a single root element wrapping all the other elements. The CDATA content starts with a paragraph tag, which stops in the middle, then picks up with some table elements). That's where the 'junk' error comes from - Perl is complaining about a poorly formed XML document.

      I was a little embarrassed that I thought "read_in_some_data" was a real method - but the conversation has been really helpful - I honestly thought the whole documentation was trying to trick me and just couldn't figure out where my error was (silly me). Jenda had alluded to just picking up the CDATA fragments and re-parsing, which led me down a whole weird dead end. He has the right approach, I just interpreted his comments incorrectly.

      This XML stuff is "fussy". I'm hoping to wrap my head around all of this because having scripts like this will make the code overall much easier to maintain. Just find the feed, parse, and go.

      I'm just having trouble with the recursive bits of the overall process. The data changes and thus the details have to change.

      Thanks to all for pitching in. This has been a useful discussion about feeds, XML, etc.

      Matt

        It had slipped my mind that the summary was CDATA as I didn't look back at the previous post. And you're right, it's the explanation for the junk message. If Perl is complaining about a poorly formed XML document, it's because we are trying to convince it that $summary is XML.

        It isn't, of course, it's HTML. And that's what Jenda meant when he said

        If you want to split that into pieces you have to pass that string to another HTML or XML parser.

        I was about to suggest parsing $summary with LWP or HTML::Parser when I read poj's post. I like how he has simplified it and shown HTML::TreeBuilder handling $summary.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1039168]
Approved by Jim
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2021-03-03 00:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favorite kind of desktop background is:











    Results (67 votes). Check out past polls.

    Notices?