Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

by mr_p (Scribe)
on Jun 25, 2010 at 17:07 UTC ( #846554=note: print w/ replies, xml ) Need Help??


in reply to Re: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
in thread Parsing ITEM Tag from RSS feed using XML::RSS::LibXML

Here is my code.

The TODO part is where I have to put my code. But everything I read for XML::RSS::LibXML doesn't let me pull out "item" tag.

#!/usr/bin/perl use File::Path; use Data::Dumper; use LWP::UserAgent; use XML::RSS::LibXML; use POSIX qw(strftime); use Time::HiRes qw(gettimeofday tv_interval); my $client = LWP::UserAgent->new(); my ($fh, $feed, $feed_title, $count, $node); my $rss = XML::RSS::LibXML->new; my $website_name = "usnews"; my $url = "http://www.usnews.com/rss/health-news/index.rss"; $firstListing = 1; while (1) { if ( $website_name eq "" ) { next; }; print "polling: $website_name url: $url\n"; $capture = $client->get("$url", ":content_file" => "/tmp/.rss_down +load_file") || die"$!\n"; $rss->parsefile('/tmp/.rss_download_file'); print "channel: $rss->{channel}->{title}\n"; @curListOfItems = (); foreach my $item (@{ $rss->{items} }) { my $node_link = $item->{link}; if (defined $node_link) { $curItem=$node_link ."\n"; push (@curListOfItems, $curItem); } } if ($#prevListOfItems != -1 ) { # @newlyAddedLinks will be latest in curListOfItems and not in + @prevListOfItems @newlyAddedLinks=grep!${{map{$_,1}@prevListOfItems}}{$_},@curL +istOfItems; foreach my $l (@newlyAddedLinks) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } print "Getting1 $filename\n"; } elsif ( $firstListing == 1) { print "Getting2 $filename\n"; foreach my $l (@curListOfItems) { my $fileName=getFileName(); $fileName="/tmp/.$website_name\_${fileName}"; my $capture = $client->get("$l", ":content_file" => "$file +Name"); # TODO: Pull out the current Item tag ( <item> .....</item +> ) } $firstListing = 0; } @prevListOfItems = @curListOfItems; open OUT_FILE, "> /tmp/.$website_name" || die "could not open file + $!"; print OUT_FILE "@prevListOfItems"; close OUT_FILE; sleep 1; } sub getFileName { my ($seconds, $microseconds) = gettimeofday(); my $padded_usecs = sprintf ('%06d', $microseconds); my ($logType, $str1, $str2) = split ('\|',$LogElement); $todaysDate = strftime "%d", localtime; $currentDateTime = strftime "%Y:%m:%d:%H:%M:%S", localtime; ($Year,$Month,$Date,$Hour,$Minute,$Seconds) = split /:/, $currentD +ateTime; $curYear = sprintf ('%04d', $Year); $curMonth = sprintf ('%02d', $Month); $curHour = sprintf ('%02d', $Hour); $curMinute = sprintf ('%02d', $Minute); $curDate = sprintf ('%02d', $Date); $curSec = sprintf ('%02d', $Seconds); my $fname = "${curYear}${curMonth}${curMonth}${curHour}${curMinute +}${curSec}.html"; return "$fname"; }


Comment on Re^2: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
Download Code
Replies are listed 'Best First'.
Re^3: Parsing ITEM Tag from RSS feed using XML::RSS::LibXML
by rowdog (Curate) on Jun 26, 2010 at 18:12 UTC

    So, you're parsing an RSS file and then, for each item, you are fetching the link. What you get back is HTML not RSS so no, I don't think you'll get far trying to process the links with XML::RSS::LibXML.

    I'm not sure what you plan to do with the HTML documents but you already have XML::LibXML loaded into RAM so you could use it to parse the HTML:

    use XML::LibXML; my $dom = XML::LibXML->load_html( location => $fileName, recover => 1, # handle marginal HTML ); print $dom->toString;

    The parser options for load_html are documented in XML::LibXML::Parser.

      I just keep polling for the same rss feed and compare the Link from last poll and when ever new Links or Item has been added I only want to parse the Item tag of the New Item.

        I think I see now. You're pushing the links for each item onto an array and then reducing to the set of new links. Once there, you want to get back to the $rss->item that the link came from.

        You still have $rss so you can find the item by searching. Maybe something like

        sub find_item() { my $link = shift; for my $item ( $rss->{items} ) { $item->{link} eq $link and return $item; } return undef; }

        On reflection, I don't really care for the way you're keeping track of seen items. All that map grep stuff can be replaced with a simple hash. Maybe you'll have more luck if you restructure things a bit. Here's my skeletal example.

        #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use XML::RSS::LibXML; my $url = 'http://www.usnews.com/rss/health-news/index.rss'; my $rss_file = '/tmp/.rss_download_file'; my $website_name = "usnews"; my %seen; my $client = LWP::UserAgent->new; my $rss = XML::RSS::LibXML->new; while ( 1 ) { print "polling: $website_name url: $url\n"; $client->mirror($url, $rss_file); # be nice to the server $rss->parsefile($rss_file) or die $!; if ( !%seen ) { print "first listing\n"; } foreach my $item ( @{ $rss->{items} } ) { $seen{ $item->{link} }++ and next; # already saw this item # do stuff with the new item print $item->{title}, "\n"; print "$item->{pubDate}\n"; #$client->get() ... } sleep 15 * 60; # 15 minutes, play nice }

        As an aside, fetching the RSS file every second is a good way to convince the server that you're attacking it. 15 minutes is probably okay but you should check the Terms of Service to be sure. On that same note, I like LWP::UserAgent's mirror method because it sends the "If-Modified-Since" header so you don't fetch the file if it hasn't changed.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://846554]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2015-07-08 01:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls