Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

How do I clean RSS feeds to make them usable?

by ajt (Prior)
on Apr 25, 2002 at 09:11 UTC ( [id://161895]=perlquestion: print w/replies, xml ) Need Help??

ajt has asked for the wisdom of the Perl Monks concerning the following question:

I have started to play with RSS/RDF feeds. I read a few articles on the topic, and found that RSS is very easy to use. Grab the file via HTTP (HTTP::GHTTP or LWP), use the XML::RSS parser to convert to a constant format, and then use XML::LibXSLT to transform to a piece of usable HTML.

I've come up against a few problems (e.g. Why so slow from CGI, but not command line?), the most annoying of which is that some people provide RSS feeds that are not valid XML. The commonest problems I've seen is that they include entities without including the entity definition, and there is terrible unescaping of the ampersand symbol.

As per any good XML application this causes the system to fail with no usable output.

Without wishing to encourage sloppy XML, as some companies encouraged sloppy HTML, what methods are there available to wash the file before passing it to RSS or LibXSLT?

I've considered using a RegEx, but I know that's not a place I wish to go. Though it could be a way of locating isolated an & and replacing it with &.

Yesterday I saw Matts mention xmllint on his use Perl; column. I don't know anything about it or if there is a Perl interface for it that works on Windows. I see there is also XML::Clean and I could always try Tidy in XML mode.

Questions:

  1. Am I being unwise in accepting sloppy XML, is this the thin end of the wedge?
  2. Should I accept the sloppy XML, I'm too small to get big companies to clean their act up, and wash the XML?
  3. If I am to clean up the XML, what Perl ways of doing this are there?

Humble thanks in advance...

Some Useful Resources:

Replies are listed 'Best First'.
Re: How do I clean RSS feeds to make them usable?
by tomhukins (Curate) on Apr 25, 2002 at 10:32 UTC
    Parsing badly formed RSS or XML might help you find an answer to your first two questions. I use the code I described in that discussion and find it copes with most common RSS errors, although I'm well aware of its faults and wouldn't use it in any serious application where robustness or accuracy is required.
Re: How do I clean RSS feeds to make them usable?
by Fletch (Bishop) on Apr 25, 2002 at 12:30 UTC

    Don't think that you're too small to get people to clean up. I mailed the editors at The Register after getting fed up with having to clean up their RSS for my personal homepage one time too many. I explained that I appreciated the service they're providing, but that they weren't generating valid RSS per the specs (and provided URLs to same).

    Got a nice reply back from (IMSMR) their technical editor that they couldn't do much as it was generated straight from their content managment system or some such, but that he'd send a note to the authors asking them to be careful what they used in titles. There's still the occasional glitch, but I have noticed that I don't see as many errors as there use to be.

    As for fixing things, I use HTML::Entities' decode_entities routine to check. If that doesn't pass on something I run it through s/&(\S*)/&$1/

Re: How do I clean RSS feeds to make them usable?
by Matts (Deacon) on Apr 25, 2002 at 12:34 UTC
    Try my rssmirror script. I use that on axkit.org to download feeds. It doesn't do a whole lot, but it should be a good start.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://161895]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-03-19 08:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found