Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^3: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

by rowdog (Curate)
on Jun 26, 2010 at 18:12 UTC ( #846700=note: print w/replies, xml ) Need Help??

in reply to Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
in thread Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

So, you're parsing an RSS file and then, for each item, you are fetching the link. What you get back is HTML not RSS so no, I don't think you'll get far trying to process the links with XML::RSS::LibXML.

I'm not sure what you plan to do with the HTML documents but you already have XML::LibXML loaded into RAM so you could use it to parse the HTML:

use XML::LibXML; my $dom = XML::LibXML->load_html( location => $fileName, recover => 1, # handle marginal HTML ); print $dom->toString;

The parser options for load_html are documented in XML::LibXML::Parser.

Replies are listed 'Best First'.
Re^4: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by mr_p (Scribe) on Jun 27, 2010 at 02:38 UTC
    I just keep polling for the same rss feed and compare the Link from last poll and when ever new Links or Item has been added I only want to parse the Item tag of the New Item.

      I think I see now. You're pushing the links for each item onto an array and then reducing to the set of new links. Once there, you want to get back to the $rss->item that the link came from.

      You still have $rss so you can find the item by searching. Maybe something like

      sub find_item() { my $link = shift; for my $item ( $rss->{items} ) { $item->{link} eq $link and return $item; } return undef; }

      On reflection, I don't really care for the way you're keeping track of seen items. All that map grep stuff can be replaced with a simple hash. Maybe you'll have more luck if you restructure things a bit. Here's my skeletal example.

      #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use XML::RSS::LibXML; my $url = ''; my $rss_file = '/tmp/.rss_download_file'; my $website_name = "usnews"; my %seen; my $client = LWP::UserAgent->new; my $rss = XML::RSS::LibXML->new; while ( 1 ) { print "polling: $website_name url: $url\n"; $client->mirror($url, $rss_file); # be nice to the server $rss->parsefile($rss_file) or die $!; if ( !%seen ) { print "first listing\n"; } foreach my $item ( @{ $rss->{items} } ) { $seen{ $item->{link} }++ and next; # already saw this item # do stuff with the new item print $item->{title}, "\n"; print "$item->{pubDate}\n"; #$client->get() ... } sleep 15 * 60; # 15 minutes, play nice }

      As an aside, fetching the RSS file every second is a good way to convince the server that you're attacking it. 15 minutes is probably okay but you should check the Terms of Service to be sure. On that same note, I like LWP::UserAgent's mirror method because it sends the "If-Modified-Since" header so you don't fetch the file if it hasn't changed.

        Finding Item:

        There are some RSS websites that do not have links as new Items, They are embedded as headlines and stories. In which case I need to parse the Item tag right from there.


        It is reliable for me to use modification time. Does the modification time change when page changes, meaning is it RSS feed requirement?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://846700]
and the pool shimmers...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2018-02-25 02:56 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (312 votes). Check out past polls.