Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

by mr_p (Scribe)
on Jun 27, 2010 at 02:38 UTC ( #846747=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
in thread Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

I just keep polling for the same rss feed and compare the Link from last poll and when ever new Links or Item has been added I only want to parse the Item tag of the New Item.


Comment on Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
Re^5: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by rowdog (Curate) on Jun 28, 2010 at 19:07 UTC

    I think I see now. You're pushing the links for each item onto an array and then reducing to the set of new links. Once there, you want to get back to the $rss->item that the link came from.

    You still have $rss so you can find the item by searching. Maybe something like

    sub find_item() { my $link = shift; for my $item ( $rss->{items} ) { $item->{link} eq $link and return $item; } return undef; }

    On reflection, I don't really care for the way you're keeping track of seen items. All that map grep stuff can be replaced with a simple hash. Maybe you'll have more luck if you restructure things a bit. Here's my skeletal example.

    #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use XML::RSS::LibXML; my $url = 'http://www.usnews.com/rss/health-news/index.rss'; my $rss_file = '/tmp/.rss_download_file'; my $website_name = "usnews"; my %seen; my $client = LWP::UserAgent->new; my $rss = XML::RSS::LibXML->new; while ( 1 ) { print "polling: $website_name url: $url\n"; $client->mirror($url, $rss_file); # be nice to the server $rss->parsefile($rss_file) or die $!; if ( !%seen ) { print "first listing\n"; } foreach my $item ( @{ $rss->{items} } ) { $seen{ $item->{link} }++ and next; # already saw this item # do stuff with the new item print $item->{title}, "\n"; print "$item->{pubDate}\n"; #$client->get() ... } sleep 15 * 60; # 15 minutes, play nice }

    As an aside, fetching the RSS file every second is a good way to convince the server that you're attacking it. 15 minutes is probably okay but you should check the Terms of Service to be sure. On that same note, I like LWP::UserAgent's mirror method because it sends the "If-Modified-Since" header so you don't fetch the file if it hasn't changed.

      Finding Item:

      There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.

      LWP::UserAgent

      It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

        It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

        Yes and no. There are recommendations and requirements (see rfc), but web servers/sites frequently ignore such.

        Finding Item: There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.

        I imagine there's a better way to do this but you can look at the structure of that particular file and figure out what you need to pull out. I see $rss->{channel}->{link} as being the kind of thing you're asking about, but there's no item there, just a link (and other elements of the channel).

        LWP::UserAgent It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

        Yes and no, like Anonymous Monk said.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://846747]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2014-07-30 23:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (241 votes), past polls