Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

One more parsing ATOM question

by mcoblentz (Scribe)
on Jun 10, 2013 at 23:24 UTC ( [id://1038173]=perlquestion: print w/replies, xml ) Need Help??

mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I'm trying to parse an ATOM feed and only want a few of the data in certain fields (updated, georss:point, which = the location in +/- format, and the time of the occurrence, which is in a CDATA field).

I was thinking to use XML::Rules to parse out the georss:point field and from there decide how to extract and parse the CDATA Time stamp but I can't get past the initial run - I get a:

"not well-formed (invalid token) at line 1, column 25, byte 25 at quake_parsing_4a1.pl line 22" response when I add in the rules and try to parse the xml. What am I doing wrong?

I welcome suggestions on how to debug this. Thanks for your suggestions!

A snippet of the feed:

<entry> <id>urn:earthquake-usgs-gov:us:c000hkbe</id> <title>M 4.9 - 71km SW of Paita, Peru</title> <updated>2013-06-10T16:39:37.373Z</updated> <link rel="alternate" type="text/html" href="http://earthquake.usgs.go +v/earthquakes/eventpage/usc000hkbe"/> <link rel="alternate" type="application/cap+xml" href="http://earthqua +ke.usgs.gov/earthquakes/eventpage/usc000hkbe.cap"/> <summary type="html"> <![CDATA[ <p class="quicksummary"><a href="http://earthquake.usgs.gov/earthquake +s/eventpage/usc000hkbe#dyfi" class="mmi-I" title="Did You Feel It? ma +ximum reported intensity (0 reports)">DYFI? - <strong class="roman">I +</strong></a></p><dl><dt>Time</dt><dd>2013-06-10 14:21:17 UTC</dd><dd +>2013-06-10 09:21:17 -05:00 at epicenter</dd><dt>Location</dt><dd>5.5 +45&deg;S 81.572&deg;W</dd><dt>Depth</dt><dd>47.92 km (29.78 mi)</dd>< +/dl> ]]> </summary> <georss:point>-5.5453 -81.5723</georss:point> <georss:elev>-47920</georss:elev> <category label="Age" term="Past Day"/> <category label="Magnitude" term="Magnitude 4"/> </entry>

and then my code:

#!/usr/bin/perl -w use XML::FeedPP; use XML::LibXML; use XML::Rules; use LWP::Simple; use strict; use warnings; # get the Atom feed from the USGS my $source = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary +/4.5_day.atom'; my $feed = XML::FeedPP->new( $source ); # Incorporating a rules section so that we can extract the information + from the CDATA section # and any other specific fields we might need. First attempt will be +to pull the 'Updated' # field data from the Atom feed. I tried to follow the example in the + CPAN summary but I'm # clearly lost. my @rules = ( _default => sub {$_[0] => $_[1]->{_content}}, id => sub {$_[0] => $_[1]->{_content}}, author => undef, # ignorning the author because I already kno +w it's the USGS updated => sub {print "$_[1]->{updated}\n"; }, ); my $parser = XML::Rules->new(rules => \@rules); $parser->parse( $feed ); # This section extracts the title field, then performs string manipula +tions to extract the # long location data and the magnitude. Funny that USGS does not have + a magnitude field # in this feed. foreach my $quake( $feed->get_item() ) { my $title = $quake->title(); my $place = substr($title, 8); my $magnitude = substr($title, 2,3); # my $id = $quake->get( $id ); # note that my attempt to get fiel +d data did not work. # my $update = $quake->updated(); # my $location = $quake->'georss:point'(); print "Magnitude ", $magnitude, " about ", $place, "\n"; }

thoughts?

Replies are listed 'Best First'.
Re: One more parsing ATOM question
by frozenwithjoy (Priest) on Jun 11, 2013 at 02:30 UTC

    I've not used these modules, but from looking at them, my $feed = XML::FeedPP->new( $source ); creates a feed object and $parser->parse( $feed ); parses a string or IOhandle. Therefore, you are getting an error because you are giving the parser something unexpected.

    Converting the feed object to a string solves your issue:

    my $atom = $feed->to_string(); $parser->parse( $atom );

    Output following this fix:

    Magnitude 4.9 about Galapagos Triple Junction region Magnitude 5.0 about 75km ESE of Pangai, Tonga Magnitude 4.9 about 71km SW of Paita, Peru Magnitude 4.7 about 68km E of Sarangani, Philippines Magnitude 4.6 about 60km WNW of Lata, Solomon Islands Magnitude 4.8 about 170km ESE of City of Saint Paul, Alaska Magnitude 4.6 about South of the Fiji Islands Magnitude 4.6 about 98km NNW of Yunaska Island, Alaska

    P.S. I get some Use of uninitialized value in concatenation (.) or string at ... warnings in your rules section.

      Oh! I didn't realize I was switching types. Thank you!
Re: One more parsing ATOM question
by jakeease (Friar) on Jun 11, 2013 at 08:38 UTC

    Inspired by frozenwithjoy, I played with it a little, arriving at

    #!/usr/bin/perl -w use XML::FeedPP; use XML::LibXML; use XML::Rules; use LWP::Simple; use strict; use warnings; # get the Atom feed from the USGS my $source = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary +/4.5_day.atom'; my $feed = XML::FeedPP->new( $source ); # Incorporating a rules section so that we can extract the information + from the CDATA section # and any other specific fields we might need. First attempt will be +to pull the 'Updated' # field data from the Atom feed. I tried to follow the example in the + CPAN summary but I'm # clearly lost. my @rules = ( _default => sub {$_[0] => $_[1]->{_content}}, id => sub {$_[0] => $_[1]->{_content}}, # author => undef, # ignoring the author because I already know + it's the USGS author => sub {$_[0] => $_[1]->{_content}}, updated => sub {print "$_[1]->{updated}\n"; }, ); my $parser = XML::Rules->new(rules => \@rules); my $atom = $feed->to_string(); $parser->parse( $atom ); # This section extracts the title field, then performs string manipula +tions to extract the # long location data and the magnitude. Funny that USGS does not have + a magnitude field # in this feed. foreach my $quake( $feed->get_item() ) { my $title = $quake->title(); my $place = substr($title, 8); my $magnitude = substr($title, 2,3); my $id = $quake->get( "id"); my $update = $quake->get( "updated" ); my $location = $quake->get( "georss:point"); print "Magnitude ", $magnitude, " about ", $place, "\n"; print "id is $id, updated at $update, georss:point is $location \n +"; }

    Running it,

    C:\Users\jkeys>perl c:\myperl\usgsfeed_norules.pl Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Use of uninitialized value in concatenation (.) or string at c:\myperl +\usgsfeed_norules.pl line 25. Magnitude 4.9 about 67km SW of Painan, Indonesia id is urn:earthquake-usgs-gov:us:c000hl79, updated at 2013-06-11T03:00 +:47.746Z, georss:point is -1.8063 100.1636 Magnitude 4.7 about 131km NNE of Calama, Chile id is urn:earthquake-usgs-gov:us:c000hl5v, updated at 2013-06-11T03:13 +:17.731Z, georss:point is -21.3693 -68.4452 Magnitude 4.9 about Galapagos Triple Junction region id is urn:earthquake-usgs-gov:us:c000hl36, updated at 2013-06-11T06:39 +:02.261Z, georss:point is 1.3674 -101.2936 Magnitude 5.0 about 75km ESE of Pangai, Tonga id is urn:earthquake-usgs-gov:us:c000hl2w, updated at 2013-06-11T06:31 +:30.385Z, georss:point is -20.0901 -173.6971 Magnitude 4.9 about 71km SW of Paita, Peru id is urn:earthquake-usgs-gov:us:c000hkbe, updated at 2013-06-10T22:23 +:52.275Z, georss:point is -5.5453 -81.5723 Magnitude 4.7 about 68km E of Sarangani, Philippines id is urn:earthquake-usgs-gov:us:c000hk85, updated at 2013-06-10T19:54 +:48.457Z, georss:point is 5.2973 126.0703 Magnitude 4.6 about 60km WNW of Lata, Solomon Islands id is urn:earthquake-usgs-gov:us:c000hk60, updated at 2013-06-10T16:28 +:39.451Z, georss:point is -10.4972 165.3268

    This provides a few examples of accessors. I didn't attempt to dive into the rules, other than to comment out the $author => undef which quieted some of the uninitialized value warnings.

    Hope this helps.

Re: One more parsing ATOM question
by Jenda (Abbot) on Jun 11, 2013 at 13:11 UTC

    With the other problems solved, let's look at the rules.

    1. The _default => sub {$_[0] => $_[1]->{_content}}, is better written as _default => 'content',. There are several builtin rules so if that you need to do with a tag matches one of them, it's better to use the builtin instead of a custom rule.

    2. If the rule for a tag matches the _default rule, you don't need the tag-specific rule.

    3. The rule for the <updated> tag should be updated => sub {print "$_[1]->{_content}\n";},. You want to print the contents of the tag, not the value of its (nonexistent) attribute named "updated".

    4. Once you need to work with more of the data you'll probably replace the rule for <updated> with the builtin rule "content" and specify a custom rule for the <entry> tag. In that rule all the contents of the child tags will be available in the $_[1] hashref as $_[1]->{childtagname}.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      Jenda,

      Thanks for your suggestions on the rules. I've gotten to the point where I'm trying to extract values out of the CDATA field. I've tried a lot of different ideas (HTML tables, simple HTML extraction, stripping tags, RegExp, etc.) but I think that using the ::Rules engine would simply be the most straightforward. I've read your CPAN writeup on ::Rules (are you the author? Very cool) and studied but I'm not quite sure how to best proceed.

      I can extract the CDATA content and end up with a resultant set of tags and values. Your comment leads me to believe that I can create a hash of the tags and values then pick the ones I want. That seems to be the exact discussion in the ::Rules section about addresses, streets, Larry Wall, multiple tags and hashrefs. But I don't understand the discussion in that section, can you expand further?

      Your XML::Rules section, quoted below, would seem to be the relevant part.

      our %states = ( AL => 'Alabama', AK => 'Alaska', ... ); ... state => sub {return 'state' => $states{$_[1]->{_content}}; } or address => sub { if (exists $_[1]->{id}) { $sthFetchAddress->execute($_[1]->{id}); my $addr = $sthFetchAddress->fetchrow_hashref( +); $sthFetchAddress->finish(); return 'address' => $addr; } else { return 'address' => $_[1]; } }

        In XML, these two are equivalent: <foo>&lt;bar/&gt;</foo> and <foo><![CDATA[<bar/>]]></foo>. Thus the content of the <summary> tag is the "<p class="quicksummary"><a href="http://earthquake.usgs...". If you want to split that into pieces you have to pass that string to another HTML or XML parser. It's like a box that, apart from other things, contains another box so after you've opened the outer box, you have to extract the inner box and open it as well.

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1038173]
Approved by frozenwithjoy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-20 02:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found