Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^5: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

by rowdog (Curate)
on Jun 28, 2010 at 19:07 UTC ( #846960=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
in thread Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

I think I see now. You're pushing the links for each item onto an array and then reducing to the set of new links. Once there, you want to get back to the $rss->item that the link came from.

You still have $rss so you can find the item by searching. Maybe something like

sub find_item() { my $link = shift; for my $item ( $rss->{items} ) { $item->{link} eq $link and return $item; } return undef; }

On reflection, I don't really care for the way you're keeping track of seen items. All that map grep stuff can be replaced with a simple hash. Maybe you'll have more luck if you restructure things a bit. Here's my skeletal example.

#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use XML::RSS::LibXML; my $url = 'http://www.usnews.com/rss/health-news/index.rss'; my $rss_file = '/tmp/.rss_download_file'; my $website_name = "usnews"; my %seen; my $client = LWP::UserAgent->new; my $rss = XML::RSS::LibXML->new; while ( 1 ) { print "polling: $website_name url: $url\n"; $client->mirror($url, $rss_file); # be nice to the server $rss->parsefile($rss_file) or die $!; if ( !%seen ) { print "first listing\n"; } foreach my $item ( @{ $rss->{items} } ) { $seen{ $item->{link} }++ and next; # already saw this item # do stuff with the new item print $item->{title}, "\n"; print "$item->{pubDate}\n"; #$client->get() ... } sleep 15 * 60; # 15 minutes, play nice }

As an aside, fetching the RSS file every second is a good way to convince the server that you're attacking it. 15 minutes is probably okay but you should check the Terms of Service to be sure. On that same note, I like LWP::UserAgent's mirror method because it sends the "If-Modified-Since" header so you don't fetch the file if it hasn't changed.


Comment on Re^5: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
Select or Download Code
Re^6: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by mr_p (Scribe) on Jun 28, 2010 at 19:41 UTC
    Finding Item:

    There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.

    LWP::UserAgent

    It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

      It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

      Yes and no. There are recommendations and requirements (see rfc), but web servers/sites frequently ignore such.

      Finding Item: There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.

      I imagine there's a better way to do this but you can look at the structure of that particular file and figure out what you need to pull out. I see $rss->{channel}->{link} as being the kind of thing you're asking about, but there's no item there, just a link (and other elements of the channel).

      LWP::UserAgent It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

      Yes and no, like Anonymous Monk said.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://846960]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2014-12-28 00:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls