http://www.perlmonks.org?node_id=485882

xdg has asked for the wisdom of the Perl Monks concerning the following question:

I tried Super Search to see if this had been discussed, but most of the deluge of RSS questions seem to consist of "I'm trying to scrape RSS and I'm clueless, please help" so I gave up in frustration. Apologies if I missed something obvious somewhere.

I'm not clueless and I've been working with RSS for a while now (c.f Code for Perlmonks XML to RSS), and I'm a little frustrated with various incompatibilties and breakage that I encounter dealing with people's feeds. I'm currently using combinations of XML::RSS and XML::RAI -- though largely because that's what I started with. So my questions are these:

  1. What modules for RSS parsing have people found to be the most robust and stable (given unreliable, non-standard input feeds)?

  2. What modules best parse all the various feed standards? (E.g. XML::RSS docs are inconsistent about RSS 2.0 support)

  3. What modules best produce all the various feed standards?

  4. What pre-processing have people found helpful in cleaning up non-standard feeds to keep XML::Parser and the like from giving up on errors?

On that last point, I'll share my own helpful snippet. I'm currently doing a rather hackish bit with a regex and HTML::Entities::Numbered to fix up some of the broken encodings that I'm commonly finding on various feeds that was breaking XML::Parser. YMMV.

$content =~ s/(&#\d+);?/$1;/g; $content = name2decimal_xml( $content );

Thanks,

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Replies are listed 'Best First'.
Re: Best RSS modules and techniques?
by Hero Zzyzzx (Curate) on Aug 23, 2005 at 14:09 UTC

    Not answering your question directly, but: I'm the (sorta new) maintainer of XML::RSS and I'd be happy to get some help improving the module. It needs a documentation update because it does support 2.0 pretty well. It needs Atom support, too.

    If you're willing to pitch in (at least testing it), I'd be happy to send you pre-release tarballs of the new version.

    -Any sufficiently advanced technology is
    indistinguishable from doubletalk.

    My Biz

      Sure, I'm willing to help test. Email a tarball to my cpan email (DAGOLDEN). Or if you have a subversion server, send me the URL.

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      Well, why should XML::RSS support Atom? There is already XML::Atom.

        One stop shopping? With Atom support, XML::RSS could create and parse pretty much all RSS formats of note.

        -Any sufficiently advanced technology is
        indistinguishable from doubletalk.

        My Biz

Re: Best RSS modules and techniques?
by spatterson (Pilgrim) on Aug 24, 2005 at 15:07 UTC
    I've had a fair bit of success with XML::RSSLite before, though it doesn't handle entity encoded characters (< etc.) too well.