Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Writing a simple RSS aggregator.

by DigitalKitty (Parson)
on Dec 06, 2003 at 02:11 UTC ( #312708=perlquestion: print w/ replies, xml ) Need Help??
DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I am in the process of writing an RSS aggregator for a professor and he is not willing to install modules himself or permit anyone else to do so ( annoying indeed ). Therefore, I am relegated to using regexes. In the process of testing the tool on thraxil.org, I noticed there is no output displayed. My code ( thus far ):

#!c:\perl\bin\perl.exe -w use strict; use LWP::Simple; use CGI qw( :standard ); print "Content-type: text/html\n\n"; print start_html; my $data = get("http://thraxil.org/rss"); my $scalar; open (F, ">test.txt") or die $!; print F $data; close F; open( F2, "<test.txt" ) or die "Error : $!\n"; while(<F2>) { if ( /<title>\s*(.*?)\s*<\/title><link>(.*?)<\/link>/m ) { print "<a href=$2>$1</a><br><br>"; } } close F2; print end_html();


The rss source I am trying to parse can be seen at http://thraxil.org/rss.
I need to capture the title, link, and description data then display each group of three with the <link> info as a hyperlink to the article / node.

I feel as though I am quite close but a little assistance would be quite beneficial.

Thanks,
-Katie.

Comment on Writing a simple RSS aggregator.
Download Code
Re: Writing a simple RSS aggregator.
by Anonymous Monk on Dec 06, 2003 at 02:26 UTC
    You must use the CPAN!!!!

      If you had engaged your brain before the create button, you would have seen that she said:

      I am in the process of writing an RSS aggregator for a professor and he is not willing to install modules himself or permit anyone else to do so ( annoying indeed ). Therefore, I am relegated to using regexes.

      It is for that reason that I --'d the response (one of my few times to ever do so), although responding as Anonymonk you will likely never feel anything from it. (I am responding here because, were the positions reversed, I would prefer to know why someone would do so to me, so common courtesy requires that I behave in turn.)

      Having been in the CB when she has discussed this before, it has been quite clear that she would much rather be using standard CPAN modules (and would probably have it done by now, if so), but that the professor in question is the problem preventing doing so.

      While I disagree with the professor, sometimes one has to work around the requirements for a project with what is provided.

Re: Writing a simple RSS aggregator.
by thraxil (Prior) on Dec 06, 2003 at 02:33 UTC

    your regexp is never matching. the <link> and <title> are never on the same line in the feed.

    that's your immediate problem i think. parsing the feed with regexes, you're likely to run into plenty of other problems that i'm sure the other monks will point out.

Re: Writing a simple RSS aggregator.
by Zaxo (Archbishop) on Dec 06, 2003 at 02:43 UTC

    The xml in the feed is spread over several lines, but you're reading only one at a time. No one line matches all your regex.

    Try setting local $/ = '</item>'; before reading. The alternative is to forget the intermediate file, rely on the linebreaks, and do global matching a la,

    my $regex = /<title>(.*?)<\/title>\n<link>(.*?)<\/link>\n<description> +(.*?)\n<\/description>/; while ($data =~ /$regex/g) { #... }
    That is pretty fragile, however. I suspect you're doing this as a favor and it seems odd that you have to rewrite the good xml modules to do it.

    LWP::Simple is just as optional as the XML modules, which you should be able to use. There is even one for rss.

    After Compline,
    Zaxo

      Here are a couple of other methods that were inspired by Zaxo:

      Method #1

      #!/usr/bin/perl -w use strict; use LWP::Simple; use CGI qw( :standard ); require 5.8.0; print "Content-type: text/html\n\n"; print start_html; my $RSS = get("http://thraxil.org/rss"); { local $/ = "</item>"; open my $rss, "<", \$RSS or die "Aaiiigh - $!"; while (<$rss>) { my ($title) = m!<title>(.*?)</title>!is; my ($link) = m!<link>(.*?)</link>!is; my ($desc) = m!<description>(.*?)</description>!is; next unless $title && $link && $desc; print "Title: $title\nLink: $link\nDescription: $desc\n\n"; } close $rss; }

      Method #2

      #!/usr/bin/perl -w use strict; use LWP::Simple; use CGI qw( :standard ); print "Content-type: text/html\n\n"; print start_html; my $RSS = get("http://thraxil.org/rss"); my @items = $RSS =~ m!<item.*?>(.*?)</item>!gis; for (@items) { my ($title) = m!<title>(.*?)</title>!is; my ($link) = m!<link>(.*?)</link>!is; my ($desc) = m!<description>(.*?)</description>!is; next unless $title && $link && $desc; print "Title: $title\nLink: $link\nDescription: $desc\n\n"; }

      Each of these has its own merits, but if you want to do it right, use a real parser from CPAN. :-)

Re: Writing a simple RSS aggregator.
by thraxil (Prior) on Dec 06, 2003 at 02:50 UTC

    also, aside from parsing issues, i'd like to point out that an RSS aggregator is an HTTP client and should be polite by properly supporting HTTP response codes and using things like Etags and If-Modified-Since headers to not overload the server (especially when it's mine ;)

    for my RSS gathering and parsing, i actually prefer to use Mark Pilgrim's ultra-liberal feed parser, since he's even more anal about that stuff than i am. it doesn't require anything beyond the *cough* python core library, so if the server has python installed, that may be an option...

      Incidentally, Spidering hacks has Perl code for doing all the friendly etag, if-modified-since stuff. Which at some point I'll be building into WWW::Mechanize::Cached, along with Expires awareness.

Re: Writing a simple RSS aggregator.
by DigitalKitty (Parson) on Dec 06, 2003 at 03:33 UTC
    Thanks Zaxo and Anders.

    I'm quite the rss neophyte (as if that wasn't already obvious). *wink*

    -Katie.
Re: Writing a simple RSS aggregator.
by demerphq (Chancellor) on Dec 06, 2003 at 10:45 UTC

    Personally I think you should direct your Professor here to one of the many many different threads discussing the "you can't install modules" meme. A professor should know better. And certainly shouldnt be asking you to reinvent wheels to satisfy his perverse reluctance to stay up to date. Especially as insofar as pure perl modules go he doesn't have a leg to stand on. (And should really be made to know it.)


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi


Re: Writing a simple RSS aggregator.
by mtve (Chaplain) on Dec 07, 2003 at 08:19 UTC

    try my aggregator

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://312708]
Approved by Zaxo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2014-11-26 05:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (162 votes), past polls