I tried Super Search to see if this had been discussed, but most of the deluge of RSS questions seem to consist of "I'm trying to scrape RSS and I'm clueless, please help" so I gave up in frustration. Apologies if I missed something obvious somewhere.
I'm not clueless and I've been working with RSS for a while now (c.f Code for Perlmonks XML to RSS), and I'm a little frustrated with various incompatibilties and breakage that I encounter dealing with people's feeds. I'm currently using combinations of XML::RSS and XML::RAI -- though largely because that's what I started with. So my questions are these:
What modules for RSS parsing have people found to be the most robust and stable (given unreliable, non-standard input feeds)?
What modules best parse all the various feed standards? (E.g. XML::RSS docs are inconsistent about RSS 2.0 support)
What modules best produce all the various feed standards?
What pre-processing have people found helpful in cleaning up non-standard feeds to keep XML::Parser and the like from giving up on errors?
On that last point, I'll share my own helpful snippet. I'm currently doing a rather hackish bit with a regex and HTML::Entities::Numbered to fix up some of the broken encodings that I'm commonly finding on various feeds that was breaking XML::Parser. YMMV.
$content =~ s/(&#\d+);?/$1;/g;
$content = name2decimal_xml( $content );
Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||