Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I tried Super Search to see if this had been discussed, but most of the deluge of RSS questions seem to consist of "I'm trying to scrape RSS and I'm clueless, please help" so I gave up in frustration. Apologies if I missed something obvious somewhere.

I'm not clueless and I've been working with RSS for a while now (c.f Code for Perlmonks XML to RSS), and I'm a little frustrated with various incompatibilties and breakage that I encounter dealing with people's feeds. I'm currently using combinations of XML::RSS and XML::RAI -- though largely because that's what I started with. So my questions are these:

  1. What modules for RSS parsing have people found to be the most robust and stable (given unreliable, non-standard input feeds)?

  2. What modules best parse all the various feed standards? (E.g. XML::RSS docs are inconsistent about RSS 2.0 support)

  3. What modules best produce all the various feed standards?

  4. What pre-processing have people found helpful in cleaning up non-standard feeds to keep XML::Parser and the like from giving up on errors?

On that last point, I'll share my own helpful snippet. I'm currently doing a rather hackish bit with a regex and HTML::Entities::Numbered to fix up some of the broken encodings that I'm commonly finding on various feeds that was breaking XML::Parser. YMMV.

$content =~ s/(&#\d+);?/$1;/g; $content = name2decimal_xml( $content );

Thanks,

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.


In reply to Best RSS modules and techniques? by xdg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others browsing the Monastery: (6)
    As of 2014-12-27 09:00 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (176 votes), past polls