No such thing as a small change | |
PerlMonks |
How do I clean RSS feeds to make them usable?by ajt (Prior) |
on Apr 25, 2002 at 09:11 UTC ( [id://161895]=perlquestion: print w/replies, xml ) | Need Help?? |
ajt has asked for the wisdom of the Perl Monks concerning the following question:
I have started to play with RSS/RDF feeds. I read a few articles on the topic, and found that RSS is very easy to use. Grab the file via HTTP (HTTP::GHTTP or LWP), use the XML::RSS parser to convert to a constant format, and then use XML::LibXSLT to transform to a piece of usable HTML.
I've come up against a few problems (e.g. Why so slow from CGI, but not command line?), the most annoying of which is that some people provide RSS feeds that are not valid XML. The commonest problems I've seen is that they include entities without including the entity definition, and there is terrible unescaping of the ampersand symbol. As per any good XML application this causes the system to fail with no usable output. Without wishing to encourage sloppy XML, as some companies encouraged sloppy HTML, what methods are there available to wash the file before passing it to RSS or LibXSLT? I've considered using a RegEx, but I know that's not a place I wish to go. Though it could be a way of locating isolated an & and replacing it with &. Yesterday I saw Matts mention xmllint on his use Perl; column. I don't know anything about it or if there is a Perl interface for it that works on Windows. I see there is also XML::Clean and I could always try Tidy in XML mode. Questions:
Humble thanks in advance... Some Useful Resources:
Back to
Seekers of Perl Wisdom
|
|