Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

RSS Headline Sucker

by radixzer0 (Beadle)
on Apr 27, 2000 at 22:30 UTC ( [id://9443]=sourcecode: print w/replies, xml ) Need Help??
Category: XML
Author/Contact Info radixzer0
Description:

Make your own Portal! Amaze your friends! Confuse your enemies!

This quick hack takes a list of RSS feeds and pulls the links into a local database. If you don't know what RSS is, it's the cool XML headline standard used by My Netscape. Lots of sites provide RSS feeds that you can use to make headline links on your own site (like Slashdot.org, ZDNet, etc.). I used this to make a little headline scroller using DHTML on our company Intranet. This script works best with a scheduler (like cron) to update on a periodic basis.

For a comprehensive list of available feeds, take a look at http://www.xmltree.com.

Comments/improvements more than welcome ;)

#!c:\perl\bin\perl

use LWP::UserAgent;
use XML::RSS;

#We're running this off of a Windows machine, connecting to a M$SQL se
+rver
# although any old SQL server would do (e.g. MySQL) 

use Win32::ODBC;

$DSN = "TESTSERVER";


#Create a new UserAgent to pull the XML data down
$ua = new LWP::UserAgent;
$ua->agent("HeadlineAgent/0.1 ".$ua->agent);

#connect via ODBC to the SQL server
if(!($db = new Win32::ODBC($DSN))){
    print "Error connecting to $DSN\n";
    print "Error: " . Win32::ODBC::Error() . "\n";
   exit;
}

# We'll be pulling in RSS files from various sources, 
# their URL's are stored in the SQL database

my %sources;

if($db->Sql("SELECT * FROM ExternalNewsSources"))
{
    print "SQL failed.\n";
    print "Error: " . $db->Error() . "\n";
    $db->Close();
    exit;
}

while($db->FetchRow()){
    my(%data) = $db->DataHash();
#    ...process the data...
#    Add to hash of hashes
    $sources{$data{'ExternalNewsSourceID'}} =  $data{'Source'};
}

#Create the RSS object to parse the RSS files retrieved...
my $rss = new XML::RSS;

($sec,$min,$hour,$mday,$mon,$year) = localtime(time);
# preformatted string compatible with SQLServer's timestamp field
$nowstring = sprintf("%02i/%02i/%i %02i:%02i:%02i",($mon+1),$mday,($ye
+ar+1900),$hour,$min,$sec);

#Walk through each of the XML sources
foreach $sourceid(keys %sources)
{
# fetch RSS file from the source's URL
    my $request = new HTTP::Request GET => $sources{$sourceid};
    my $result = $ua->request($request);

    if($result->is_success)
    {
#   grok the RSS file retrieved
        $rss->parse($result->content);

#   Step through all the links in the RSS
        for my $i (@{$rss->{items}})
    {
#       Check to see if we've already seen this link from this source 
+before...
            $db->Sql("SELECT * FROM ExternalNews WHERE SourceID=".$sou
+rceid." AND Link = '".$i->{'link'}."'");
            if($db->FetchRow())
            {
            #skip it - it's here already...
            }
            #Sometimes the RSS mis-parses and give us an empty item
            elsif(length($i->{'title'}) <= 0)
            {
            #skip it - it's empty...
            }
            else
            {
            #Plunk it into the database
                $db->Sql("INSERT INTO ExternalNews (SourceID,PostDate,
+Title,Link,Description) VALUES ($sourceid,'$nowstring','".$i->{'title
+'}."','".$i->{'link'}."','".$i->{'description'}."')");
            }
# Nuke the current values in the object, it appears that the XML lib r
+ecycles the variables without clearing them...
            $i->{'title'} = '';
            $i->{'link'} = '';
            $i->{'description'} = '';
        }
    }
    else
    {
        print "Doh! couldnt get ".$sources{$sourceid}.": $!\n";
    }
}

#clean up
$db->Close();
Replies are listed 'Best First'.
RE: RSS Headline Sucker
by merlyn (Sage) on Oct 01, 2000 at 00:11 UTC
RE: RSS Headline Sucker
by cei (Monk) on May 03, 2000 at 10:54 UTC
    Can you post the fields of the first table? I figured the second table from the insert statement, but I wanted to see how you set up your sources to pull from.

    Thanks.

      Sorry for taking so long to respond, didnt see your request till now.
      I was running this out of Access (ick), although the idea is the same on any database (e.g. mySQL):
      ExternalNewsSourceID -> autoNumber (internal unique id) Title -> text (name of the site) Link -> text (link to the site, not the source) Description -> text (description of the site) Source -> text (URL for the actual feed)

      I populated this list by hand by selecting sources from xmltree.com.
      Let me know if you need more detail. -r0

        Trying this again in HTML...Wow, your code is exactly what I need.  Cannot get it to run on Access though.
         
        This is the content of my first table named ExternalNewsSources
         
         
        ExternalNewsSourceID Title Link Description Source
        1 SlashDot http://slashdot.org/ Headlines http://slashdot.org/slashdot.rdf
         
        Format of the first table is as suggested
         
        ExternalNewsSourceID  ->  autoNumber (internal unique id)
        Title  ->  text  (name of the site)
        Link  ->  text  (link to the site, not the source)
        Description  ->  text  (description of the site)
        Source  ->  text  (URL for the actual feed)
         
        My second table is named ExternalNews and has the following columns:
         
        SourceID          -text
        PostDate          -text
        Title             -text
        Link              -text
        Description       -text
         
        I have a System DSN named TESTSERVER as in the example and I'm running your example code as published. 
         
        I don't get any errors.... just no data in the tables.  Do I have something simple wrong?
        Wow, your code is exactly what I need. Cannot get it to run on Access though. This is the content of my first table named ExternalNewsSources which has only one row at the moment. Columns: ExternalNewsSourceID Title Link Description Source With data: 1 SlashDot http://slashdot.org/ Headlines http://slashdot.org/slashdot.rdf Format of the first table is as suggested ExternalNewsSourceID -> autoNumber (internal unique id) Title -> text (name of the site) Link -> text (link to the site, not the source) Description -> text (description of the site) Source -> text (URL for the actual feed) My second table is named ExternalNews and has the following columns: SourceID -text PostDate -text Title -text Link -text Description -text I have a System DSN named TESTSERVER as in the example and I'm running your example code as published. I don't get any errors.... just no data in the tables. Do I have something simple wrong?
RE: RSS Headline Sucker
by perlcgi (Hermit) on Apr 28, 2000 at 00:49 UTC
    Way to go radixzer0! Nice post. I saw a 6 month perl contract recently, looking for someone to, among other things, craft a similiar headline sucker. Kudos man.
      thanks :)
      BTW, if anybody wants the DHTML end of it, let me know...
      radix0 at yahoo dot com -r0

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://9443]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-03-19 02:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found