http://www.perlmonks.org?node_id=846960


in reply to Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
in thread Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

I think I see now. You're pushing the links for each item onto an array and then reducing to the set of new links. Once there, you want to get back to the $rss->item that the link came from.

You still have $rss so you can find the item by searching. Maybe something like

sub find_item() { my $link = shift; for my $item ( $rss->{items} ) { $item->{link} eq $link and return $item; } return undef; }

On reflection, I don't really care for the way you're keeping track of seen items. All that map grep stuff can be replaced with a simple hash. Maybe you'll have more luck if you restructure things a bit. Here's my skeletal example.

#!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use XML::RSS::LibXML; my $url = 'http://www.usnews.com/rss/health-news/index.rss'; my $rss_file = '/tmp/.rss_download_file'; my $website_name = "usnews"; my %seen; my $client = LWP::UserAgent->new; my $rss = XML::RSS::LibXML->new; while ( 1 ) { print "polling: $website_name url: $url\n"; $client->mirror($url, $rss_file); # be nice to the server $rss->parsefile($rss_file) or die $!; if ( !%seen ) { print "first listing\n"; } foreach my $item ( @{ $rss->{items} } ) { $seen{ $item->{link} }++ and next; # already saw this item # do stuff with the new item print $item->{title}, "\n"; print "$item->{pubDate}\n"; #$client->get() ... } sleep 15 * 60; # 15 minutes, play nice }

As an aside, fetching the RSS file every second is a good way to convince the server that you're attacking it. 15 minutes is probably okay but you should check the Terms of Service to be sure. On that same note, I like LWP::UserAgent's mirror method because it sends the "If-Modified-Since" header so you don't fetch the file if it hasn't changed.

Replies are listed 'Best First'.
Re^6: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by mr_p (Scribe) on Jun 28, 2010 at 19:41 UTC
    Finding Item:

    There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.

    LWP::UserAgent

    It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

      Finding Item: There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.

      I imagine there's a better way to do this but you can look at the structure of that particular file and figure out what you need to pull out. I see $rss->{channel}->{link} as being the kind of thing you're asking about, but there's no item there, just a link (and other elements of the channel).

      LWP::UserAgent It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

      Yes and no, like Anonymous Monk said.

      It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

      Yes and no. There are recommendations and requirements (see rfc), but web servers/sites frequently ignore such.