Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^3: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

by rowdog (Curate)
on Jun 26, 2010 at 18:12 UTC ( #846700=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
in thread Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

So, you're parsing an RSS file and then, for each item, you are fetching the link. What you get back is HTML not RSS so no, I don't think you'll get far trying to process the links with XML::RSS::LibXML.

I'm not sure what you plan to do with the HTML documents but you already have XML::LibXML loaded into RAM so you could use it to parse the HTML:

use XML::LibXML; my $dom = XML::LibXML->load_html( location => $fileName, recover => 1, # handle marginal HTML ); print $dom->toString;

The parser options for load_html are documented in XML::LibXML::Parser.


Comment on Re^3: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
Select or Download Code
Reaped: Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by NodeReaper (Curate) on Jun 27, 2010 at 02:36 UTC
Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by mr_p (Scribe) on Jun 27, 2010 at 02:38 UTC
    I just keep polling for the same rss feed and compare the Link from last poll and when ever new Links or Item has been added I only want to parse the Item tag of the New Item.

      I think I see now. You're pushing the links for each item onto an array and then reducing to the set of new links. Once there, you want to get back to the $rss->item that the link came from.

      You still have $rss so you can find the item by searching. Maybe something like

      sub find_item() { my $link = shift; for my $item ( $rss->{items} ) { $item->{link} eq $link and return $item; } return undef; }

      On reflection, I don't really care for the way you're keeping track of seen items. All that map grep stuff can be replaced with a simple hash. Maybe you'll have more luck if you restructure things a bit. Here's my skeletal example.

      #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use XML::RSS::LibXML; my $url = 'http://www.usnews.com/rss/health-news/index.rss'; my $rss_file = '/tmp/.rss_download_file'; my $website_name = "usnews"; my %seen; my $client = LWP::UserAgent->new; my $rss = XML::RSS::LibXML->new; while ( 1 ) { print "polling: $website_name url: $url\n"; $client->mirror($url, $rss_file); # be nice to the server $rss->parsefile($rss_file) or die $!; if ( !%seen ) { print "first listing\n"; } foreach my $item ( @{ $rss->{items} } ) { $seen{ $item->{link} }++ and next; # already saw this item # do stuff with the new item print $item->{title}, "\n"; print "$item->{pubDate}\n"; #$client->get() ... } sleep 15 * 60; # 15 minutes, play nice }

      As an aside, fetching the RSS file every second is a good way to convince the server that you're attacking it. 15 minutes is probably okay but you should check the Terms of Service to be sure. On that same note, I like LWP::UserAgent's mirror method because it sends the "If-Modified-Since" header so you don't fetch the file if it hasn't changed.

        Finding Item:

        There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.

        LWP::UserAgent

        It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://846700]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2014-09-22 07:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (182 votes), past polls